DOCUMENT RESUME 



ED 214 944 

AUTHOR^ 
TITLE 



INSTITUTION 

SPONS AGENCY 
PUB DATE 
GRANT 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



ABSTRACT 



TM 820 026 



Wilco*, Rand R. 
Test Design Project: Studies in Test Adequacy. Annual 
Report. ) 
^California Univ. f Los Apgeles. Center for the Study 
of Evaluation. 

National Inst, of Educations (ED) , Washington ~ D,C. 
Nov 81 „ .* 

NIE-G-80-0112 ' 
289p.; For* related documents see ED 211 592 and ED 
212 650. * 

MF0iyPC12 Plus Postage. \ 
Achievement Jests; Criterion Referenced Tests; 
Guessing (Tests); *Mathematical Models; *Multiple 
Choice T^sts; Scoring Formulas; Testing Problems-; 
Test Itenta; *Test Reliability;' Test Theory 4 
*Answer -Ulrtil Correct;, *Distractors (Tfests) 

— - j - - * 1 . S - , . - 

-I , 



Thesef studies in test adequacy focijs^in two problems: 
procedures.for estimating rel'i&bility , and techniques for identifying 
ineffective distraqtors. Fourteen papers are presented on recent 
advances in measuring achievement (a response to Molenaar); "an 
extension of the Dirichlet-multinomial model that allocs true score 
and guessing to be correlated"; results on an answer-until-correct 
scoring procedure; the k out of n reliability of a test, and an ex^ct 
test for random guessing; "determining th£ length of multiple choice 
criterion-referenced tests when an answgr-until-correct scoring ' 
procedure is used"; "a closed sequential procedure for comparing the 
binomial distribution to a standard"; "a closed sequential procedure 
for answer-until-correct tests"; "approximating the probability of 
identifying the, most effective treatment for the case of normal 
distributions having unknown and unequal variances"; estimating the - 
reliability of a marstery test with the* beta-binomial model; 
"analyzing *the distractots of multiples choice test items or 
partitioning multinomial cell probabilities wi^h respect to a 
standard"; "solving measurement problems with an answer-until-cdrrect 
procedure"; and "a polarization tesjb forsaking inferences about the i 
entropy of multiple-choice test items." (Author/BW) 



********************************************************************* 

* - Reproductions supplied by EDRS are the best that can be made 

* w i 'from the original document. 

****************************************************************£**** 



Center for the Study of Evaluation 



UCLA Graduate School of Education 
Los Angeles, California 90024 




^ US. DEPARTMENT OF EDUCATION 

NATIONAL INSTITUTE OF E0UCAT1ON 
EDUCATIONAL RESOURCES INFORMATION 

* CENTER (ERIC) 

£ The document has been reproduced as 
receded from the person or organization 



ong*wtmg.rt 
Maw changis r 



t have been made to improve 
reproduction quality ^ 

• Pcnots of view or opmfons stated tn thts docu-* 
ment do not necessary represent official NIE 
* position or po&cy 





ri 

0 



0 

ERIC 



s 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 

J ,C, fearer , 



TO THE EDUCATIONAL RESOURCES 
INFORMATION GENTEfl (ERIC).** i 




s / 




i 




9 

ERIC 



/■ 



V 



DELIVERABLE - November 1981 • 

JEST DESIGN PROJE.CT: . 
STUDIES IN TEST ADEQUACY ■ 

m 

.ANNUAL REPORT 
^Rand Wilcox, Study Dj^ector 



Grant Number 
NIE-G-80-0112 
I • P-3 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education • 
University of California, Los Angeles 



/ 



•The project presented or, reported herein was performed pusuant 
to a grant from the National Institute' of Education, Department 
of Education. However, the opinions expressed herein do not 
necessarily reflect the position of policy of the National 
Institute of Education, and no official endorsement by the 
National Institute of Education should be inferred. 



/ Table of Contents 



INTRODUCTION ( . \ ' 

(A) METHODS AND RECENT ADVANCES IN MEASURING ACHIEVEMENT: 
A RESPONSE TO MOLENAAR 



(B) AN EXTENSION OF THE DIRICHLET-MULTINOMIAL MODEL THAT ALLOWS 
* TRUE SCORE AND GUESSING TO BE CORRELATED 



(C) SOME EMPIRICAL AND THEORETICAL RESULTS ON AN ANSWER-UNTIL- 
CORRECT SCORING PROCEDURE 



(D) SOME NEW RESULTS ON AN ANSWER-UNTIL-CORRECT SCORING PROCEDURE 



(E) USING RESULTS ON k OUT OF n SYSTEM RELIABILITY TO STUDY AND 
CHARACTERIZE TESTS 

» * , 

(F) BOUNDS ON THE K OUT OF N RELIABILITY OF A TEST, AND AN EXACT 
■ TESr FOR RANDOM GUESSING 



(G) DETERMINING THE LENGTH OF MULTIPLE CHOICE CRITERION-REFERENCED 
TESTS WHEN AN ANSWER-UNTIL-CORRECT SCORING PROCEDURE IS USED 

(H) A CLOSED SEQUENTIAL PROCEDURE FOR COMPARING THE BINOMIAL 
DISTRIBUTION TO A STANDARD * 



(I) A CLOSED SEQUENTIAL PROCEDURE FOR ANSWER-UNTIL-CORRECT TESTS 

t 

(J) APPROXIMATING THE PROBABILITY OF IDENTIFYING THE MOST EFFECTIVE 
TREATMENT FOR THE CASE OF NORMAL DISTRIBUTIONS HAVING UNKNOWN 
AND UNEQUAL VARIANCES 

) 

(K) A CAUTIONARY NOTE ON ESTIMATING THE RELIABILITY OF A 
MASTERY TEST WITH THE BETA-BINOMIAL MODEL' 

• ) 

i 




4 



(L) ANALYZING .THE DETRACTORS OF MULTIPLE-CHOICE TEST ITEMS OR 
PARTITIONING MULTINOMIAL. CELL PROBABILITIES WITH RESPECT TO* 
A STANDARD 



(M) ' SOLVING MEASUREMENT PROBLEMS WITH AN ANSWER-UNTIL-CORRECT 
PROCEDURE „ J 



(N) f, PPW R I2ATI0NJEST_F0R MAKING INFERENCES ABOUT THE ENTROPY OF 

1 



.MULTIPLE-CHOICE TEST ITEMS 




INTRODUCTION . * 

• . ! 

CSE ^Studies UxAest Adequaqy focused on two theoretical problems 
during FY1,981: 1) procedures for estimating re-liability and 2) improved 
techniques for identifying ineffective dfstractors. Applications of these , 
techniques were also to be demonstrated in the analysis of multiple choice 

Si 1 

tests* As any psychometrician will agree, the areas of reliability and 
identifying distractors are intimately related. Progress on one area* is 
likely to influence 'thought on the other. -Although, for the purposes of 
this report, the progress of research is divided into two discrete sections, 
in fact, selected papers integrate" 1 tendings in the two areas^- In the 
August, 1980 plan for the Test Design Project, it was proposed that these 
analyses consider data from the study of "Literacy Assessment in a School 
District Context." However^ at the request of the NIE, the latter study 
was deleted from CSE's scope of work; as a result, empirical -analyses 
trying out newly proposed solutions used available data* 

'Work in Studies of Test Adequacy proceeded faster than anticipated. 
Initial solutions required little revision, and an Important new technique 
proved very valuable in addressing several test adequacy problems. As a 
result, more work than anticipated was completed, and an additional^ aspect 
of reliability, test length, was. also addressed, although not required by 
the scope of work. " * • 

The accomplishments for the year are briefly described below,' including 
work directly related tp each problem area and the extension of the developed 



solutions to other contexts, 



\ 



f 7 



^ ♦ 

The methodology used in testing* the solutions in the problem areas of 
reliability and identifying distract^rs is a mathematical one. In general, 
it depends upon the positing of a "lemma", a mathematical statement' presumed 
to solve a given prqklem, and then testing mathematically the quality, of the 
solution. In the text of this report, as the process of exploring potential 
solutions iV traced, "advantages'" arid "limitations" are noted but not fully 
described. These terms are used in" the mathematical sense and are not matters 
of personal preference. Advantages (or limitations) of potential solutions 
are demonstrated mathematically wi the coordinate, referenced research papers 
prepared in this project and are obvious by inspection of the equations. 
The estimation of reliability 

The problem with estimating the reliability of tests is that the usual 
and customary estimation procedures either ignore the problem of guessing 
altogether or make clearly inappropriate assumptions about how guessing 
affects the data. One frequ^nt^ assumption is that guessing is completely 
random. At the beginning, / of~i the year it seemed that existing latent structure 
models might provide a ^solution to the guessing issue. However, the obvious 
proble m with this solution is that tfte required model deamnds mathematical 
assuM fa r A frequently impossible to meet. Elaboration of this view 

is contJ^^^^p»Veport entitled "Methods and Recent Advances on in' Measuring 
Ach-ievem^J^t was decided^, therefore, to search for another model, one 
which would allow a solution to the guessing Jssue within more, realistic 




constraints. A first attempt in this search is described in "An Extension 
of the. Dirichlet-Multinomial Model that Allows True Score and Guessing to 
be Correlated." The new model had theoretical advantages over existing 
models, but there was no convincing evidence that ft had any practical 
advantages, and after considerable review, it was" abandoned. 

The next attempt was based on an answer-until -correct scoring model. 
This solution is described in -"Seme Empirical and Theoretical Results on 
an Answer-Unti 1 -Correct Scoring Model". All indications are that this new 
model substantially improves on existing procedures, both theoretically 
and empirically. However, in. a very few instances, some of the items used 
in the study seemed inconsistent with the assumptions being made. Accordingly, 
another empirical, study was conducted to see whether an additional model . 
would "explain" those remaining items. The results of this study are contained 
in "Some New Results on an Answer-Unti 1 -Correct Scoring Procedure"'. At the same 
time, it was also thought desirable' to develop a new reliability coefficient 
that reflects the effectiveness of the distractors being used, as an attempt 
to integrate the main substantive ar.eas under review, The first step toward 
this goal is described in "Using k out of n System Reliability to Study and 
Characterize Tests". However, the reasonableness of certain requisite 
assumptions was not uniformly stable, and so additional work was^undertaken 
to find a way of improving this situation. A procedure for doing^this is 
shown in "Bounds on the k out of n Reliability' of a Test, and. an Exact Test 
for Random Guessing". , - *•' 

In addition, a related concern of reliability is the matter of test \ 

length. Two projects previously funded by NIE include approaches to 

i 

' 



criterion- referenced tests, and determining test length. Our; new results 
have important implications in both these areas, which are described in 
"Determining the Uength of a Criterion-Referenced Test when an Answer- 

Qntil -Correct Scoring Procedure is Used", in "A Closed Sequential Procedure 

, /• * . • 

for Comparing the\£inomial Distribution to a Standard" and in "A Closed 
Sequential Procedure for Answer-Until -Correct Tests". 

In this general area^of reliability, the problem of reliable selection 
also occurs, that is, technique's. for identifying the t best of k examinees, 
.An. existing procedure is; jisually impractical tecause- 4fc-W§bt require too 
. many items, a test length issue. A step toward solving this problem ;s to 
develop retrospective methods, and some results on how this Tnight be done 
are described in "Approximating the Probability of Identifying the 'Most 
Effective Treatment fo/ the Case of Normal Distributions Having Unknown 
and Unequal Variances." Additional materials generated this year are: / 

■ — A Cautionary Note on Estimating the Reliability of a 
Mastery Test with the ,g£ta Binomial- Model" 

— Methods and Recent Advances in Measuring Achievement ; 
A Response to Molenaar 

Each of these papers is provided in the following pages. 

The identificajcion'of dfstractors 

Our original plan for analyzing distractors is described in "Analyzing 

the DistracUrs of Mul^ple-Choice Test Items or Partitioning Multinomial Cell 

Probabilities with Respect to a Standard." However, this approach proved to 

be unsatisfactory on several grounds, In particular/, i.t di;d not giye a direct 

meastire<of -how effective the distractors really are, One possibility considered 



10 



\ 



for this particular issue was to anajyze how distractors behave in' the 
context of the answer-unti -correct test format. ' Two procedures were 
proposed and* described^ in "Solving Measurement Problems with an Answer- : 
Until-Correct Scdring Procedure." A problem that remained was determining 
whethec the' assumptions made were reasonable. This was "empirically v N 

investigated in "Some Empirical and Theoretical Results on an Answer-Until- 
Correct Scoring Procedure", and in "Some New Results on an Answer-Until -Correct 
Scoring Procedure.*" \ 

Ne$, it was deemed important to consider how di stractors might be 
analyzed in terms of their relation to the n items on a test. This work- 
was jexplicated in "Using k out of'n System Reliability, to Study and 
Characterise Tests" and in "Bounds' on the k out of n Reliabil ity'of a Test. 
.Additional work* on" distractors is described in "A Polarization Test for 
Making Inferences About the Entropy of Muultip'le-Choice Tests", and in 
Analyzing the Distractors of Multiple-Choice Test Items or Partitioning 
Multinomial* Cel. 1 Probabilities with Respect" to a Standard, 

J 



\ 



0 

ERIC 



h 



/ 



4 

J. 



■ * I 




METHODS AND RECENT ADVANCES IN 

r 

MEASURES ACHIEVEMENT: 

• A RESPONSE TO MOLENAAR 
/ 



Rand R. Wilcox 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California . Los Angeles 

/ * and the 

1 DEPARTMENT OF PSYCHOLOGY 
University of Southern California 



v 



' Commenting on a paper I. -published in this journal \Wil.cox, 1979a), 

Molenaar (1981) has raised some questions aboqf the usefulness ancl • 

» t * 

feasibility of measuring achievement with latent structure models/ In 
the last twoor'three years, considerable progress has been made regard-" 
ingthe issues mentioned by Molenc\ar. The purpose of this note is to in- 
dicate the progress that has be.en made, to describe alternative solutions 
'that have been recently proposed, and id comment on some .of Molenaar's 
suggestions on how the model might be improved* 



1. INTRODUCTION _ ✓ ' 

■ " r* 

Commenting on a paper I published in this journal (Wilcox, 1979a), 

Mbl^iaar (1981) has raised some important issues related to measuring 

/' . 4 

achievement with latent structure models. The purpose of $his note is tp 

y 

briefly outline where we now stand 'iri regard to the concerns expressed 
by Molenaar. Before doing so,, let me establish some notation, and make 



some opening remarks. 

Suppose we /have a domain of skills and a single examinee. Let ? be^ 
the proportion of skills the examinee has acquired Further suppose that 
every skill is represented by one or more items. Let 0 s Pr(correct response | 
the examinee ^oes not know) when the examinee answers ^an item corresponding 
to a randomly sampled skill. Final-ly, let y be the joint probability of 
not knowing and being correct. Thus, y =0(1-?). * * 

° The abo\fe model is based on what I cap! Type II guessing. It is 
important to realize that the latent strudture models referenced in my 
paper (Wflcox, 1979 / , p. 62) are based on Type I guessing/ That is, 
guessing is defined in terms of a single skill and a population of examinees. 
The purpose of the first section of my paper was to show that we can inter- 



change the role of items and examinees to estimate Type II guessing which 
^in tum'makes it possible to solv^the problems described in sections 2 
and 3. 



2. PAIRWISE EQUIVALENT ITEMS^ 

^he first issue raised by Molen^ar^is about the notion of equivalent 
items. Two ijtems that measure the same skill are said to be equivalent if an' 
examinee knows Both or neither one. Molenaar points out that equivalent items 



r 



14 



might exist in some instances but there are situations where the .creation 

f ♦ " 

*of equivalent items is difficult or even impossible: He- is,, of course, 
correct. • , s 

Two aspects of this problem need to be addressed. The first is to 

. indicate six ways we can empirically check whether two or more items are 
equivalent. The .second to briefly c&mment'on four alternative approaches 
to the -problem of guessing. , ' % . \^ 

• , The first and perhaps m^st obvious approtch to checking whether iten^| 
are equivalent is to apply the usual chi-square goodness-of-fit test tb 



the latent structure model , being used. Macreacty and Dayton ( 1977J ill us- 
trate this for a model based on Type I guessing and equivalent items. We 
.note that a good fit to their data was obtained. 

Observe * though, that a poor fit does not necessarily imply that iterns 
are not equivalent. <Jt might mean that a more general model is needed* 
For example, we^ might assume that Pr(in(sorrect response*! examinee knows)>0 
(Macready and Dayton, 1977). ^ * . 

The second approach is to estimate an index that measures equivalence 
(Baker and Hubert, 1977).' 

Another way to check whether, items are .equivalent is t<* use latent 
partition analysis in the manner proposed by Hartke (19^8). . 

The fourth solution is to first assume that one of the items is T 
hierarchically related to the second. A test of this assumption is given • 
by White and Clark (1973). (See, also, Dayton, and Macready, 1976.) If 
the items are indeed equivalent, one of the parameters in the resulting 
latent structure model, say 6, will be zero (WiTcox, 1980a). If we assume^ 
6>0, a test of the hypothesis that 6=0 can be made by testing the equality 
of two cell probabilities fn a 2x2 contingency table.. This 'can be done 

i 



with McNemar's test. Some results pn>^he power of McNemar's test are given 
in Wilcox (1977a). , • ^ m 

A fifth check^on the assumption of equivalent item,pairs tan be made 
if we assume 6 is bounded above by some constant legp than 1. For example, 
if B <h> then from Wilcox J(1979a, f>. 64) it follows that two cell proba- 
bilities in a 2x2 contingency tab'te must be le$s than or equal to %. One 
way to check this assumption is described by Wilcox (in press, aj. 

Finally, assuming $ <k alsotligplies that for a randomly sampled 
pair of equivalent items, th'e probability of a correct-incorrect response 
(and the probability of an incorrect-coprect response) is less than or 
equal to the probability of two incorrect responses. This inequality is 
easily verified by aga-in referring to Wilcox (1979a, p. 64). Robertson 
(1978) describes a test* of this assumption. 

Alternative Approaches to Guessing ' f 

Suppose that empirical evidence does' not support the assumption of 
equivalent items or that we decide a priori that equivalent items do not* 
exist. In this case we h'ave four alternative approaches to the problem 
of guessing. The first is to use completion items. This might eliminate 

„ *v * * 

guessing, but errors at the item level might still exist (Hams et a?., 
1980; Macready, and Dayton, 1977): Also, in many situations, scoring com- 
pletion items is economically infeasible. 

If multiple-choice items must be used, the second alternative is to 
assume guessing is at random. Lord and Novick (1968, p. 309) note that 
this assumption can seldom be seriously entertained. Empirical investi- 

4 

gations on the usual correcUon-for-guessing formula score. support this 



s • 

view (Cross and Frary, 1977; Bliss, 1980). We might assume guessing is at 

» * 

random anyway, but this can haye serious 'consequerlces in terms of the design 
and accuracy of a test (Weitzman, 1970; Wilcox, 1980b, 1980c). 

• t 1 

Another, approach is to assume hierarchically related items are available 

;The resulting model includes equivalent itenis as a special case (e.g., 

Wilcox, 1980a). Dayton and Macready (1976) describe a general framework ' 

for handling hierarchically related items. For an even more general 

/ <*\ • 

mofJel, see Dayton and^Macready (1980). fc 

The fourth alternative is to use an apswer-uritil-correct scoring 
procedure proposed by Wilcox (1981^ (For a related scoring rule, see 
Brown, 1965.) Suppose multiple-choice test items are used with t alter- 
natives from which to choose, one of which is correct. An examinee chooses 
alteratives until the correct response is identified* Assume the pxam- 
inee can eliminate i dis^ractors from consideration wjjien the correct re- 
sponse is.not known, i=0,l,. • ,t-2. Following Horst <1933), we also assume 
that the examinee guesses at random from among those distractors that are 
not eliminated. For a specific examinee let p. be the probability of 
choosing the correct respond on the ith^ attempt of a randomly selected 
item (i=l,...,t), and let (j=0, . . . ,t-2) be the proportion of items 

v 

in the Item poql^ for which the examinee can eliminate j dfs^ractors. The 
probability of a correct, on the first attempt is * \ 



and 



, t-2 

•p, <= 5 + E 5i/(t-j) 
1 j=0' 3 . 



t-1 * 



P.. = 2 ^/(t-j) (i=2,...,t). 
j=0 3 

: 'r 



17 



The model assumes , v 

* 

which can be tested (Robertson, 1978). Wherl (2.1) is assumed, maximum 
likelihood estimates of the p^s are easily obtained using the "pool- 
adjacent-violators" algorithm (Barlow, et al. , 1972, 'pp. 13-18). If pj and 
f>2 are the usual sample mean estimates of p^ and it" follows that 
a maximum likelihood estimate of ? is S-Pj-F^ if Pj^P^* and ^ ^1^2* 
the -estimate is zero (Zehna, 1966). 

In addition to correcting for partial information, the model can 
"solve several other measurement problems (Wilcox, 1981b, 1980e). Suppose, 
for example, we have *an n-item test, and that 5 is the expected number oj 
items for which we- correctly determine whether the typical examinee knows 
the correct response. Using results in the engineering literature on 
"system reliability" it is possible to make inferences about whether X is 
Targe o^sma-11 (Wilcox, 1980e). 

Before concluding this section we make the important observation that la- 
tent structure models^based on the notion of equivalent items have been suc- 
cessfully applied to real data sets (Macre^dy & Dayton, 1977; Harris & Pearlman 
1978}/ More recently, Professor C. W. Harris and his collegues made exterfsitfe 



use of these models to measure the arithmetic achievement of studepts in 
various grade levels. Examinees were tested every week, over a period of 
many weeks. All indications are that the models are Indeed useful. 

Finally, Molenaar whites that the estimates of^Srra^are unbiased 
only if the selected pairs of equivalent items are representative/ of the 
item pool. Actually, the estimates appear to be always biased whether we 



18 



have a random sample or not. However, we do get maximum 1 ikel ihdod esti- 
mates as long as the estimates have an admissible value (WHcox, 1977). 

, 3. THE MULTINOMIAL MODEL 

V 

In the next section of Molenaar's paper, he turns his attention to 
the multinomial model. Suppose an examinee responds to n items, none of 
which are equivalent. (A strong true score model for equivalent item 
pairs is described in Wilcox, 1981.) Still considering only a single 
examinee, let y be the number of items he/she knows, and let z be the num- 
ber of items not known but guessed correctly. My paper (Wilcox, 1979a) 
considers a bivariate analog of the binomial error model (Keats and Lord, 
1962; Lord; 1965; Lord and Novick, 1968, chapter 23), In particular, I 
assume that the joint density of y and z is 

f(y z I c y) s * yr>^r) n ' y ~ z * (3 i) 

ny»z i t.Y) "y!z!(n- y -z)i » * > 



where n is the numbeif of items on the test. Ordinarily we cannot make 
inferences about c and y, but already indicated, we can make infer- % 

4 

ences about* them whemequivalent of hierarchically related items are 
available, or when^tn ari^er-un til -correct SQoring procedure is used. 

Of coursetwe "can. assume guessing is random (see in particular, 
Morrison ancl Brockway, 1979), but I have already described the problems, 
withyt/his. However, we can empirically test whether guessing is at 
random (Weitzman, 1970; Wilcox, 1981b>L When an answer-until-corrtect ^ 
scoring procedure is used, tjt)is ^corresponds to testing whether P2 = P3 = ~Pf 
If guessing is not at random, perhaps infrequently chosen distractors. could 



V 



ERIC 



19 



be modified or replaced so that this assumption is more realistic. In 
this case, results in Morrison and Brockway (1979)\ and Molenaar (1977), 
might be applied. Wilcox (in press a) gives some results that might be 
useful in identifying those distractors that are infrequently chosen. 
Note that we can also measure how far away guessing is from being random 
(Wilcox, 1981b), and we can empirically determine how many distractors 
are needed when testing a particular population of examinees (Wilcox, 
1980e). 



■ 4. THE DI-RICHLET PRIOR 

In Wilcox (1979a), I assume that c and y have a bivariate Dirichlet 

density given by 

* r(v 1 -fv 2 -h> 3 ) j , , 

9(5 > y) ~ Ttv^rtvgJrtvg) 5 \ * 2 3 (4.1) 

If we can estimate c and y for N randomly sampled examinees, we carf esti- 
mat v ^the v.j% I used equivalent items in Wilcox (1979a) to do this, but 
as already noted, two other approaches are now available&which do not 



^assump" random guessing. 

, Let Land 6- be the maximun likeli+iood estimates of \ and e, respectively , 

/ 

for the ith randomly sampled examinee (1*1,. ,N). Molenaar raises, the 

\ 

interesting question of whether we can improve upon and 8> by shrinking 
their values toward each other. Molenaar alludes .to the possibility of 
using Kelley's regression estimate of true score. 1 If "Better" estimates 

« 

of c and e are available, we might be abl^to get improved estimates of 
the hyperparameters Vj, v 2 and v 3 - If an ensemble squared error loss 
function is believed to be appropriate when estimating c(and 3), there is 



20 



reason to hope that- such a procedare might improve upon the maximum' 
likelihood estimate of c (and 3) used in my paper (e.g., Efron and 
Morris, 1973; Wilcox, 1 1978a). Griffin and Krutchkoff (1971) show that 
Kelley's* regression estimate of a parameter is the optimal linear estimate 
under |4iuared error loss if we st^irt with an unbiased esti\^ 
mate of the parameter. However, the estimates of 5 (and 3) is .biased, 
and so it is not clear whether Kelley's regression equation will help im- 
prove my estimates of ? and 3. We might use Kelley's regression estimate 
Anyway, but the efficacy of this needs to be checked. For an alternative 
way of possibly improving the estimation of c (and 3), see Wilcox (1980c). 

Molenaar also implies that using Kelley's regression, estimate of 
C and 3 might also Improve the estimates of the v^'s. There is, unfor- 
tunately, no evidence that this is ever the case. In an unpublished re- 

port, I tried a similar tactic in*a situation where unbiased estimates of 

* •* 

a parameter were available,- but the results were not overly convincing. 

Next Molenaar comments on the numerical example in my paper. To 
estimate the v.'s, I used artificially generated data on 1,000 examinees 
taking 100* pairs of equivalent items. Molenaar inferred that a large 
^number of items and examinees are needed to get reasonably accurate esti- 
mates. It shduld be pointed out, however, that the rnimber of examinees * 
and items was completely arbitrary. Just^tow accurate an estimate of the 
v^s we get with a smaller number of items is unknown. We would, of 
course, expect the accuracy of the estimates 'to depend on actual 
values of c and 3 (cf. Wilcox, 1980a). From Wilcox (1979b) we would 
also expect to find instances where a moderate number of examinees would 
give wildly inaccurate results. Such situations might be rare, but this 



21 



10 



has not been established. The main point is that currently there is no 

informatidh on how many items should b§ used when applying the model. 

i 

Note that for reasons given by Mosimann (1962) a slight modification J 
of the estimates of the v^s used in Wilcox (1979a) might be desirable. 
Mosimann (1962) describes the procedure, and Wilcox (1981a) indicates, how 
to apply^Tt^-to the case where we have pairs of equivalent items. Any 
future investigations on estimating the v^s should include this procedure. \ 

Molenaar also raises the important issue that the binomial error 
model (and consequently the multinomial model) implies that all items h$ve 
the same level of difficulty. From a theoreticai . point of view, this 
restriction o^ the/ifodel is unacceptable. A pimple way to eliminate this * 
problem is to use an approximation to the compound binomial distribution 
(Lord, 1965). However,: for many purposes, this seems to be unnecessary 
(Lord, 1965; Algina and Noe, 1978; Wilcox, 1977b, 1978a). .Also the beta- 
binomial model has given good results in other empirical investigations 
(Gross and'Shulman, 1980; Subkoviak, 1978a). Since the beta-binomial 
model appears to be both useful and robust in certain respects \ there is ' 
hope that the Dirichlet-multinomial will share the same properties since 
it is the multivariate-analog of the beta-binomial model. Some evidence 
for this is given in Wilcox (in press b) where the Dirichlet-multi^bmial 
model was applied to real data, but more work needs to be done. For 
further discussions of the binomial and be ta* binomial models^ee Wilcox 
(1981). 

, Molenaar suggests that to accomodate unequal -item difficulties, we 
might use the Rasch model. (For a review of latent trait models, see 
Hambleton et al., 1978; and for a review of some recent developments on 



22 



11 

the Rasch model, see Hairier et al., 1980.) However, this model does not 
yield an estimate .of ^ at. least not in any way that has been demonstrated, 
and it ignores the problem of guessing. Some latent trait models— but . 
not the Rasch model— have what is sometimes called a guessing parameter. 
This is just the lower asymptote of the item characteristic curve. Note, 
however, that this is different from the notion of Type I and Type II 
guessing. Thus, the Rasch. model is unable t6 solve any of the measurement 
problems described in Wilc^f ( 1981b, 1980e). No claim-is being made that 
latent trait models are useless, nor do I believe that latent trait and 
latent structure models are in competition with one another—the point is 
that they answer* different questions. For further critical remarks 
regarding the £asch model, see Lord (1974). , 

5. MODEL ADEQUACY 

Molenaar objects to the implication <?f the Dirichlet-multinomial model 
that c and 6 are independent over the population of examinees, f and from a 
theoretical point of view, he is, of course, correct. As Molenaar puts it*, 
"One wonders whether a person who knows many itenfe from the domain will 
also be more clever in guessing the remaining ones 1f only by the 'warm 
glow of success 1 "? The first point is that if we throw out the model 
because e and g are independent, we must throw out the random guessing 
model as well since ? ^nd @ are again independent. The second point is 
that when addressing a particular measurement problem the seriousness of 
assuming c and 6 to be independent is not known. • , 

. To allow 5 and B to be correlated, there appears to be three possibilities. 
The first is to replace (3.1) with ' * 

r 

? 

- 23 



V 



12 



9(c,y) = 2 
j-o 



* rivj?rivr|v 3 ) y^W)*" 1 - (4.1) 



r(v 1 +v 2 +v 3 +j) 



where c^ is a constant depending on j, but not x , x is an unknown parameter, 
and ¥ is a function of x (Wilcox, 1981a). The density (4.1) contains (3.1*) 
as a special case. Moreover, if c and 0 are assumed to" be continuous, 
they are independent if and only if (4.1) reduces to (3.1). One ^choice 
for c. and <? is c.= (j!)7 1 and 4'(x)=e T , in which case the marginal 'density 
of c belongs to the non-central beta family. Assuming (4.1) holds, let 

r=v , s=v 9 +v , and let E w mean expectation with respect to the probability 

y * ^ 

function 



c J 

■ f(y) - Y | T ) (y=o,i;...). 

The first four moments of the m'arginal density of c are 



'i = ^y 



I r+s+y J 



y 2 . = s-(s-l)y r (sn)sE y [- R ^- 



, 3 == j s{2 .s(s.l)(s--2)E y (- FT ^ r ) + 2s(s-lrts»l)E y [ r+$ * 1+y 



;(s + l)(s + 2) E y ( r+ l s+2+y ) } 



(4.2) 



ERIC 



24 



where 

d 1 ■ %{r4E(y)-2s-s(s-l)(s+l)Ey f ^ +l4y ] \ 
+ s(s+1)(s+2)f y(w] " d 2 } > ' 



v 



E(y) = t4* (t)/*(t), and ^ 

♦ 

d, - r * E(y)-2(s+l)+(s+l)(s+2) e [ r+g | y+2 ] 



Note that there is no need to evaluate E(y) when calculating since E(y) 
cancels out. 

It can be seen that E (t / )=v(xt)/^{x) and so 



for any integer k 21 0 (Chao and Strawderman, 1972) % The integral in this 
last expression can be evaluated with IMSL (1975) subroutine DECADRE. 
Thus,, the method of moments might be used to estimate the parameters in (4.1), 
It should be stressed, however, that the practical advantages of using 
(4.1) are not knpwn. 

The second approach to allowing c and 0 to be (^related is to follow 
the suggestion of Aitchison and Shen C3980) and replace (3.1) with a logistic 



V 

J 



. ' ' ' • •' ' 14 

*' . . / 

normal distribution. However, the moments are' not reducible to any simple 
form which makes this approach impractical for the problem at hand. For 
alternative generalizations .of (3.1);, see the papers cited in Wilcox (1979a) 

The third approach is v to apply Dirichl.etHmritinomial to an answer- 
until-correct scoring procedure. This, an^d other models, is now being 
tried out on some real data. The results should be available in the near 
future. 

« 

6. CONCLUDING REMARKS 

The goal in Wilcox (1979a) wa^to suggest a strong true-score model 
that allows guessing to vary over a population of examinees. Another 
motivation for tnKmodel was that there are real situations where equiva- 
lent item^jtre assumed (e.g., Wilcox, in press b), but previously there was 
no strong true-score model for handling this case. 

Molenaar has raised some important concerns abpiit whether the prob- 
lem of guessing has beeru satisfactorily dealt with. Considerable progress 
has been made since my paper was published, but I still agree with him 
that more work jieeds to be done. The important point of this paper is 
that today we have several methods for d^ing with guessina without 
assumingyit is at random. Moreover,* each solution can ^^empirically 
checked in 4 several ways. Early attempts at correcting for guessing were 
based on rather restrictive assumptions, but there seems to be situations 
where these assumptions are appropriate. More recent solutions are based 
on weaker assumptions, but we need more experience with them before they 
are routinely applied. As previously indicated, an empirical investigation 
of an answer-until -correct scoring procedure is currently underway which 
should parti alOy correct this problem. 



20 



* - * REFERENCES 

J 

* 

Mtchfson, 0., &Shen, S. M*. (1978). Log is tic- normal distributions: Some 
properties and uses. Biometrika , 67, 261-272. 
. Algina, J., & Noe, M. J. * (1978). A study of the/accuracy of Subkoviak's 
single-admiTiistration estimate of the coefficient. of agreement using 
two true-score, estimates . Journal of Educationaljfeasurement , 15 , 

- loi-iio. ♦ 

* Baker, F. B., & Hubert^ L.. J. (1977). Inference procedures* for ordering 

, . theory. Journal of Educational Statistics , 2, 217-233. * ' 

Barlow, R. , Bartholomew, D., Bremner.O., & Bru\'k, H. (1972). Statistical - 

inference undes ordeV restrictions . New York: Wiley. 
V • % 9 t 

Bliss, L. B. (1980). A test of Lord's assumption regarding examinee 

guessing behavior on multiple-choice tests using elementary school 

students. Journal of Educational Measurement, 1980, 17, 147-153. 

* * • , 

Brown, J. (1965), Multiple response -evaluation of discrimination. The 

> k , • ♦ 

^ Brit4sh Journal of Mathematical and Statistical Psychology , 18, 125-137. 
Chao, M. T., & Strawderman, W. E. (1972). Negative moments of positive 
random variables. Journal of the American Statistical 'Association. 
67, 429-431'. '~s , 

Cross, L. H. , & Frary, R. B. (1977). ,An empirical test of Lord'*s K 

* . • * * 

• -^theoretical^ results regarding formula-scoring of multiple-choice 

^•ests/ • Journal of Educational Measurement , 14, 313-321. 
DaytSL C. M., ^ Macreadv, 6: B. (1976). A probabilistic model for 1 ' 

-", validation- of behavioral hierarchies. Pfjcheroetrika , 41, 189-204. 

* * .***". 

F.fron,,B,, & Morris, C. {1973).. .Stein's estimation rule - and its competitors. 
Journal' of the American Statistical Association . 68, 117-130^. 



Griffin, B. S., & Krutchkoff, R. G. (1971). Optimal linear estimators: 
an ei^i^fcal Bayes version with application to the binomial ' . 
distribution. Biomelirika , 58, 195-201. ' ' 

Gross, A. L. , & Shulman, V. (1980). The applicability of the beta- v 
Wnomial model for criterion-referenced testing. Journal of Educational 

-t 

Measurement , 17, 195-202. 
Hambleton, R. K. , Swaminathan, H., Cook, L. L., Eignor, D. R., & Gifford, 
J. A. (1978). Developments in latent trait .theory: Models, tech- 
nical ^issues, and applications. Review of. Educational Research , 48, 
467-510. . . > 

Hartke, A. R. (1978"). The use of latent partition analysis to identify 
homogeneity of an item population. Journal of Educational Measurement , 

1978, 15> 43-47. . 

»' * • 

Harris, 0. S. , & Pearlman, A. (1978). An index for: a* domain of comple- 

tion or short answer items. Journal of Educational Sta tistics! 3,* 

, - — ' — ' 

285-304. ' 
Horst, P. (3.933).* The difficulty of a multiple-choice test item. Journal • 
** of Educational Psychology , 24, 229-232. . 

IMSL Library 1. (1975). Volume II. Houston: International Mathematical * 

- and Statistical Libraries. 
Keats, J. A., & Lord, F. M. (1962).* A, theoretical distribution for 

* mental" testes cores. Psy chomotri ka , 27, 59-72. ' 
Lord, F. M. (1965). A strong true-score theory, with appl f??Stions. 
•' Psychometri-ka , 30*,. 239-270. 




Lord, F. M. (1974). An individualized testing and item 
y curve theory. In D. H. Krgntz, R. 0. Atkinson, R. 

P, Suppes (Eds.) Contemporary developments in mathematical psychology , 
y — Volume II. San Francisco: Freeman. 

Lord, F. M. , & Novick, M. R. (1968). Statistical theories of mental 

test scores . Reading, Mas's: Addison-WesTey? 1968. 
Macready, G. B., & Dayton, C. M. (1977). The use of probabilistic models 

in the assessment of mastery. Journal of Educational Statistics , 

2, 99-120. . . 
Molenaar, I. (1977^N-0n Bayesian formula scores for random guessing in 

multiple choice tests. British Journal of Mathematical and Statistical 

Psychology , 30, 79-89. 
Molenaar, I. (1981). On Wilcox's latent structure model for guessing. 

British Journal of Mathematical and Statistical Psychology , 34. "' ' 
Morrison, D. B., & Brockway, G. A modified bets-binomial model with appli- 
cations to multiple choice and taste tests. Psychometrika , 44, 

427-442. 

Mosimann, J. E. (1962). On the compound multinomial distribution, the 

multivariate B-distribution, andcorrelations among proportions. 

Biometrika , 49, 65-82. 
Robertson, T. (1978). Testing for and against an order restriction on 

multinomial parameters. Journal of the American Statistical Association 

73,- 197-202. 

Subkoviak, M. J> (1978). Empirical investigation of procedures for 
estimating reliability f&r mastery tests. Journal of Educational 
Measurement , 15, 111-116. 

29 




18 



Wainer, H., Morgan, A., & Gustafsson, J, (1^80) . A review of estimation 

procedures for the Ras'ch model with an edge toward longish tests*/ 

Journal of Educational Statistics, 15, 35-64. 
Weitzman, R. A. (1970) > Ideal multiple-choice items. Journal of the 

American Statistical 'Association , 65, 71-89. 
White, R.T~., & Clark, R. *M. (1973). A test of inclusion which allows 

for errors of measurement. Psycfometrika , 38, 77-86. 
Wilcox, R., R. (1977a). New methods for studying stability. In C. W. 
. Harris-, A.. Pearlman,- & R. Wilcox, Achievement Tests Items— Methods 

of Study . CSE Monograph No. 6, Los Angeles; Center for the Study 
* of Evaluation, University of California. 
Wilcox, R. R. (1977b). Estimating the likelihood of a false-positive 

arid false-negative decision with a mastery test: An empirical Bayes 

approach. Journal of Educational Statistics , 2_, 289-307. 
Wilcox, R. R. , (1978). Estimating true score in the compound binomial 

error model . Psychometri ka . 43, 245-258. 
Wilcox, R. R. ( 1979a.) .* Achievement tests and latent structure models. 

Briti'sh Journal of Mathematical and Statistical Psychology . 32 , 

61-71. , 

Wilcox, R.. R. (1979b); ^ Estimating the parameters of the beta-binomial 

distribution. Educational and Psychological Measurement , 31, 527-535. 
Wilcox, R. R. (1980a). Some results and comments on using latent 

structure models to measure achievement. Educational and Psychological 

Measurement , 40 , 645-658. 
Wilcox, R. R. (1980b). ^An approach to measuring the achievement or 
^ ' proficiency of an examinee. Applied Psychological Measurement . 

4 , 241-251. 

~ ' s 



19 



Wilcox, R. R. (1980c). Determining the length of a criteri on-r eferen ced 
- test. Applied Psychological Measurement, to appear. 

Wilcox, R. R. (1980d). Solving measurement problems with an answer-until- 
correct scoring procedure. Center for the Study of Evaluation, 
. University of California, Los.Angeles. 

Wilcox, R. R. (1980e). Using results on k out of n system reliability 

x to study and characterize tests. Center for the Study of Evaluation. 

1 ^ 



University of California, Los Angeles. 

Wilcox, R. R. (1981). A review of the beta-binomial model and its 
extensions. Journal of Educational siatistics , to appear. 

Wilcox, R. R. (in press, a). Analyzing the abstractors of multiple-choice 
test items or partitioning multinomial cell probabilities with respect 
to a standard. Educational and Psychological Measurement . 

Wilcox, R. R. (in press, b). The single administration estimate of the 
proportion of agreement" of a proficiency test scored with a latent 
structure model . Educational and Psychological Measurement . 

Zehna, P. W. (1966). Invariance pf maximum likelihood estimation. 
6 Annals of Mathematical Statistics , 37, 744. 




31 



9 

ERIC 



An Extension of the Dirichlet-Multinomial 
Model that Allows True Score anj) 
* Guessing to be Correlated 



Rand R. Wilcox 



/ 



v 

CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California • Los Angeles 

i 



32 . 



Abstract 6 

f 4 

Most strong true-score models assume that when an examinee does 
not know the correct response to a test item, tfre probability of guessing, 
say e, is independent of an examinee's true score. In fact, it is common 
practice to make the more restrictive assumption that B is the same known 
constant for every examinee. One exception is the»Dirichlet-multi'nomial 
model; but true score and guessing are still assumed to be independent. 
This paper describes an extension of the Dirichlet-multinomial model that 
allows true score and guessing to be correlated* 



4 33 



Consider a multiple-choice test item designed to determine whether 
an examinee has acquired a particular skill. An obvious problem is that 
an examinee can give a correct response without knowing the answer; yet,^ 
♦in many situations, it is economically infeasible to use completion items' 
in an attempt to correct thjs difficulty. On the otherhand, guessing can 
have serious implications for certain types of achievement tests (e.g., 

v. 

Wilcox, 1980a, 1980b). Thus,*it is natural to search for scoring pro- 
cedures and probability models that take guessing into account. 

Suppose a^multi pie-choice test item has t alternatives consisting 
of t-1 distractors and one correct response. Typically, the problem of 
guessing is handled by assuming that e=Pr (correct response | examinee 
does not know) = 1/t, i.e., guessing is at random (e.g., Hamilton, 1950; 
Chernoff, 1962; Duncan, 1974; Morrison and Brockway, 1979). There are ^ 
at least ^wo serious objections to this approach. / Prrst, it is unrealis- 
tic to assume that every examinee has the same probafETNity of guessing. 
For example, some examinees might be able to eliminate one or more dis- 
tractors^ from consideration without knowing the correct response. In 

* 

this case we would expect to have ^>l/t. We might assume S=l/t but in „ 
some instances this does'not yield satisfactory results (Wilcox, 1980b). 
The second objection to seating &=l/t is the implication that true ,score 
and guessirrg are independent. As argued by Frary (1969), we would expect 
this assumption to be false. Of course, if we set B=0, we still have this . 
problenn 

Lei ? be the proportion of skills among a 'domain of skills that an 
examinee W acquired and set y=(1-<;)b. Wilcox {1979) proposed a solution 

to the first problem by assuming that, over th^ population of examinees 

) 

* 

. ' 34 



C and y have a bivariate Dirichlet distribution given by 

(1 - 0) n^n^M^) * 3 

where v .>0, 1=1,2,3 are unknown parameters end J v is the gamma function. 
It was also" assumed that the probability of x correct responses for an 
examinee taking* an n-item test is 

(1.1) [j] e x (l-e) n - x 

where e=c+Y is the examinee's percent correct true score. However, the 

! ' > 
second problem remains since (1.0) implies that c and b are independent. 

The restrictive nature of (1.0) has also concerned statisticians (e.g., 
James, 1975; Connor and Mosimann, 1969; and Antelman, 1972) but the pro- 
posed generalizations of the Dirichlet distribution have proven to fcfe 
less than satisfactory. The purpose of this paper ^is to describe a broad 
class of distribution^ that contains (1.0) as a special case, and which 
allow ? and g to be correlated. Our general results are illustrated for 

the special case where the marginal distribution of c is non-central beta. 
t 

Before continuing, however, it is convenient to examine various extensions 
Qf 

the beta distVibutibn. 

2» A Generalization of the Beta* Distribution 

In this, section we describe a family of probability density func- 
tions where (1.2) is "mixed" by a distribution that belongs to a large 
class of discrete probability functions- We then indicate how our results 



35 



might be applied when a particular general izatiorf of the beta density is 
. used to approximate g(e), the distribution of e over the population of 
examinees. - 

•Consider a random variable Y having the probability function, 

(2.1) P(V*y) ° ~y >y=0,l,.... 



tere t is an unknown parameter; c is a constant depending on y but* not 



whe 

x, and ¥ is a function of x. Expression (2.1) is referred to as a power 

series distribution by Noack (1950) • There are even moV-e general discrete 

distributions that contain the power series distribution as a special case, 

(e.g., Patil, 1962; Gupta, .1974), but they are not disused here since 

(2,1) is of sufficient generality for present purposes. 

Consider ^ 

j - 
12 2) afe) = z °i ! rCrfs-j-j) , .s-r 



It is readily verified that (2.2) has the properties of a probability 
density function, i.e., it is non-negative and it integrates to one. If, 
for example, we set C|«(iO~* and ¥(T)=e~ T we get the non-central beta 
distribution (see, e.g., Seber, 1963), and, if in addition t=0, (2.2) 
reduces to (1.2). The first three^moments of (2. '3) are, respectively, 



(2.3) H - : ' • f 

(2.4) p 2 ■ l-(s-l)» t -(»*l)s E y ( m ^ 1<y ) ' 

(2.5) p 3 = «2-s(s-l)( S -2) t,[yfe] ♦ f^py-) 

- *< s+1 >< s+2 > E y (-FW]3 

36- ' 



where denotes expectation with respect to the density given in expression 
(2.1). It can be seen that E^(t^) s ^{xt)/^{x) and so from Chao and Strawderman 
(1972) it follows that 

l 2 - 6 > E y ( FiiiCTy ) = f l «^"Mxy)dii. 

for any integer k ^ 0. For a detailed derivation of these moments, see 
Wilcox (1980c). 

Again omitting the tedious algebra, it can also be shown that 
(2.7) y 4 = y 3 - | [yg-d,] 
where 

.d, = V[r+E(y)-2s-s(s-l)(s+l) e [ rfs j ]+y ] 

+ s(s + l)(s + 2) E y (— - d 2 ], 
E<y) = t¥'(t)/¥(t), and 

d 2 = r + E(y) - 2(s+l) + (s+l)(s+2) e [ r J ytf ] ^ 

+ < s+1 >( s+3) E ( r+sly+3 ) ' 

+ (^) 2 [ E (TfiW) ■ (5+2) E ("^^' ' TfiW") 3 

Note that there is no need t6 evaluate E(y) when calculating since E(y) 
cancels out. 



37 



Some special cases . To illustrate the results given above,, suppose 
(2.8) P(Y=y) = x*(l-x) , y*0,l,.:.. 

where 6>0 and 0>t>1. In terms of (2.1) we get this distribution by setti 
c y =r(6+y)/(y!j(6)) and ?(t)=(T-x)y 6 . Thus, (2.5) becomes 



n 



1 



k r+s+k-l+y { 



= riu^" 1 CO-t)/(1-ut)] du. 



Hence, from expressions (2.3) - (2.5) and (2.7) we have the firs^ four 
moments of g(e). 

As another illustration, suppose we replace (2.8) with the hyper- 
Poisson probability function (Bardwell & Crow, 1964) given by 



2 

. where 6>0, x>0 and ft{*,x) = 1 + + -jfay- + ***% is a special 
case of the confluent hypergeometri c series. In this instance 

att>) - ? r(6) x 3 r(H-s-fj) e^fi-e) 5 " 1 
91 ; " jfo P^tol r(6+j) rCr+j) r(s)' • 

Setting ^xJ^F^.x) for fixed 6, the value of E [ ^J^y is given 
by (2.6). Again we can determine the first four moments of g(e) with. 
•(2.3)-(2.5) and (2.7). 



38 



> 

3, The Non-Central Beta Distribution 

Before describing a model that allows 5 and p to be correlated, it' 
is helpful to consider how the Results of section 2 can be used to estimate 
g(e). We do this for the case where e is assumed to have a non-central 
beta distribution given 'by 

(3.,, .M - 5 

• - ^ y 

where x>0, r>0 and s>0 are parameters to be determined. As previously 
noted, (3.1) is a special case of (2.2). The corresponding marginal dis- 
tribution of observed scores, assuming (T.l.) holds, is ' 

- r B(r+j+x, n+s-x) 

1 ; j= 0 (n+1) B ^> s) B(x+1, n+l-x) 

where B(r, s) = [r(r) r(s)]/r(r+s). We also note that if y is the observed 
score on a randomly parallel test having n } items, the joint distribution' 
of x and y is 

'. h(x, y) = (;] e" (i-e}"-* p] (i-e)V ,(.)'„, . 



e -^j B(r+j+x+y, n+n-j+s-x-y) 1 

j= 0 In+Dt^+l) B(r+j,s) B(l+x,' n+l-x) B(l+y, n^l-y) 



CO 

= £ 



This last result might be useful in the single administration estimate of a 
mastery test. (See Huynh, 1976.) 

We will need a method of estimating the parameters X, r and s using 
the observed scores of a random sample of N examinees I The first step' in 
solving this problem is deriving expressions for the first three moments 
of the non-central beta distribution. From the previous section we have 
that 



39 



(3.2) - - l-se~ x jj 1 t^ 5 "' 1 e xt .dt 
From Wishart (1932, p". 445) we see that 



o 

1 F(r+s, r+s+V, x) 



r+s 
-A 



F(l> r+s+1, -x). 



r+s 

Where F is the confluent hypergeometric series given by 
F(a, b, c) - 1 ♦ ^f- c ♦ -rfft&j- c 2 ♦ 

Hence, we have that ■ ^ . , 

s ' 

p l = 1 " ~r+T" F 0» r+s+1 » 

Tables and computational procedures described by Abramowitz and Stegun 
(1972, Chapter 13) can be used to evaluate F which in turn gives us the- 
value Pj or the value of can be determined by evaluating the integral 
in (2.3) with IMSL (1975) subroutine DECADRE. Note that for X=0 (the 
beta distribution) expression (3.2) reduces to r/(r+s) as it should. 
The second moment about the origin is 

(3.3) K - s-(s-l)v(s+l)s e * X ' ltr+S e xt dt. 

* « 
/* Finally, the third moment is T 

(3.4) % [2- ? (5-D(s-2) E^gpj ■* 2s(s-l)( 5+ l) E (-j^) 

.,. - 5 < s+, )< st2 ' E (w] ] 



40 * 



where' the expectatipjisare taken with respect-, ta, 4^ i 
having a Poissoo^distriblhion with rparameter x.' ..AgaTJT referring to 
Chao and Strawderman "(1972) • ' V 7 

= (r+s+k)** 1 F(l„ r+s+k+1, -1)*, k=0,l ,2.^ 

As before the integral in this' last expression can be evaluated wAh IMSL 
' subroutine DECADRE. \ 

It is known (e.g., Lord and Novick, 1968', p. 521) that v. , the kth 
moment §bout the origin of the true score distribution, is equal to 

(2.5) # y n D0 > k=]> 2 , /. n . ' 

where , » 

E x Ck] h(x) ' ( " * * . t 

K x=0 * V 

is the kth factorial- moment of the marginal distribution, of ..observed ® 
scores, and . 

x^ k] =x(x-l) . . . (x-k+1). . ' t 

Thus, we can use the observed scores of a random sample 'of N examinees*,* 
to estimate y R with say ji^. Substituting'-^, ^ for V] , y 2 , y 3 > 
respectively in equations (2.3)," (2.4) and (2.5) and solving for r, s and 
X yields estimates of these parameters say r, s and X. 

At present, the solution to these equations is being obtained using* 4 
numerical analysis, technique's! In particular, we used, subroutine ZSYSTM 
'to soTve y-j and y 2 for r aijd s using a fixed value of X. As initial esti- 
mates of r and s we set X=0 in which case explicit estimates of r and s 



41 



10' 



are availabe is indicated in the numerical illustration below. With the 

4 

jnitial estimate* of r, s and A we computed the corresponding value of 
iij. If this value is not-TR close agreement with yg, we (Increased X by 
one, solved for r and s, and again cotyputj^the implied value of y 3 - We 
repeated this process until values of X, r and s were found that give a 

m V 

good approximation toj^. - , 

Numerical illustration . Suppose we have a 5-item test and that f 

x 

examinees received an observed score of x, the values of which are summar- 
ized in Table 1. r 

0 

. * * Table 1 * 

* Observed Frequencies on a 5-Item Test 

x: 0 1 2 3*4 5 
,f x : 23 19 * 33 ' 15 6 4 

The first three moments of~the true score distribution were estimated to 
ft be .652, .458 and .33£ respectively. 



Setting X-0 and using the njethod of moments ,Nte*^stimate r and s with 



. •(y 1 ) 2 '(N£ l ) , 
r - — h. - 7z - p. T 

»»2- »q - 




s = 



(e.g., Hijynh, 1976; Wilcox, 1977) yielding r=*3.93' and s=2.04. From ex- 

f 

pression (3-5), or from standard results on the beta distribution, these 
values of r, s and X imply that y 3 =.346, butas previously noted, the esti- 
mate, pf fe#as .339. Therefore, we increased X to 1 and solved (3.2) and 
(3.3) forTand s with IMSL (1975) subroutine' ZSYSTM yielding r=3.2876 . 



42 



11 



t 

and s=2»2149. From (3..4) it follows that p 3 =.3389$. Thus, these values 
of r, s and X are in reasonably good agreement with the estimated values - 
of pp p 2 and \iy If (3.4) had yielded a number for greater than .339, 
we would have, increased X from 1 to 2 and repeated the process. 

/ 4. * Extensions to the Dirichlet-Multinomial Model 

In this final section, i^e use the results of the previous two secttons 
to extend the Dirichlet-multinomial model so as to aVk>w c and e to be 
correlated. First, a brief review of this model is in order. 

Consider a single examined responding to h dichotomously scored items 
randomly sampled from some item pool. Let x be the examinee's number 

N 

correct score, y be the number of items the^examinee knows and z be ^he 
number of items that the examinee does not know but guesses^the correct 

* 

response. Let e be the proportion of items in the item domain that an 
^ examinee knows and. let 0 be the probability of guessing the correct re- 
sponse given that the examinee does not know. It follows that y and z 
ave a multinomial probability function given by 

n! c y r Z t (]-S-r)"- Z " y ' 

y! z! (n-y-z) • • ' 

where Y=(l-?)8. As previously mentioned, Wilcox (1979) assumes that c '"" '-h 

'and y have a bivariate Dirichlet distribution given by (1.0). The model *^'* 
contains the beta-binomial model as special case (when 8=0) and so in 
terms of applications, it has all of the appealing features of the beta- 
binomial model that are described by Lord .(1965). An added ^advantage is 



Y 




43 



that the model allows guessing to vary over the population of examinees. • 
•In some cases latent structure models can be used to estimate ? and i 
for a specific examinee which in turn makes it possible to apply it to 
real data. (See Wilcox, -1979,' for further details.) 

The form of the non-central beta distribution suggests a generali- 
zation of (.1.01 • More specifically we consider replacing (l.o) with 

* 

It is readily verified that (4.1) is a probability density function. Note 
that if X=0, (4.1) reduces to th / e^Di>ich1 , et distribution and so we expect 



it to give as good or better an approximation to the joint density of ? 
and y. - ; % 

- >v 

, From known .results about (1.0) i fallows that the marginal densities 
of ? and y are non-central, beta distributions given* by 

(4.2) - 9l U) rtv^) r(vv 3 ) 5 f H-w 

S "■' 

■a 



and 



From results given by Ishii and Hayakawa (1960L/*it can be deduced that 
the marginal distribution 6f y and z is * , 

(4 4) D fv zl - ? e " X * 3 ' B(yi+.y+j, va+z, n-fVj-v-z) 

1 4j PU\ Z >- Sj^-ji— ( n +2)(n+l) BV r +j ^y^l B(l+y»Hz,n+T-y-z) 

where Bfa, b, p) = [r(a) r(b) r(c)]/r(a+b*c). 

• ' 44 . ' 



The density of x=y+z is 



f(x) - z -S^£ 
j=0 3 * 



B(yi-K>2+j+x, n-fv3-x) 

(n+1) B(vj+v 2 +j, v 3 ) B(Hx, n+l-x) 



and the joint distribution of x and e is 
(4.5) p(x, c) = 



e"V 



xj jf 0 3"! B(v^j,v 2 ,v 3 ) 



w=0 v*l 



B(w+v 2i n-X+vJ ? x-w+v i+ j-l (] _ ?) n-x+« + v 2+ v3-] 



Tinally, following Wilcox (1979), 
(4.6) E(tU) - $ * 

w 



=0 ' Ji B(v 1 +j,v 2 ,v 3 ) 



w=0 



Z n lw| B ^ w+V 2» "-x+Va) B(x-w+v t +j+l , n-x+w-h> 2 +v 3 ) 



The appealing feature of (4.'1) is that unless it. reduces to a Dirchlet * 
distribution, ? and & are correlated if the distributions of ? and y are 
assumed to be continuous. The proof of this statement follows from a re-- 
sult given by Darroch and Ratcliff (1S71). In particular, as a special 
case of their theorem 2, if the probability density function "of c and y is 
continuous, the independence of ? and e implies that£ and y have a Dirchlet I 
distribution. 

Numeri cal ill ustrati on . Data collected by the Maryland State Depart- 
ment of Education is used to illustrate the modified Di rich let-multinomial 
model. In particular, we use the test results on students taking ^ pre- 
liminary form of a proficiency test in mathematics. The test consisted of 
thirty skills with three items per sldll for a total of 90 items on the test 



14 



We could use the information on all three, items, associated with eachskill 
to obtain an averaged estimate of c and 8' for each examinee, the average 
being defined in the sense described by Harris and Pearlman (1978). How- 
ever, since we merely want to illustrate .the "calculations involved in 
applying the model, we simply ignore the information on the third item. 

For a ^specific examinee, we summarize the observed responses as shown 
in Table 2 where a 1 designates a correct and -a 0 an incorrect response. 



Table 2 

Observed Frequencies for. an Examinee 



Item 1 



Item 2 a 
1 0 



x n 


x 10 


x 01 


x 00 



For example, x 1Q is the number of items the examinee is correct on the first 
item of an item pair and incorrect on the second. 
Following Wilcox (1977) we estimate ? with - 



% = 1- 



' X 01 * x 00 



*00. 



X 10 + *Q0 



If Xqq=0 we set ?^equal to x^/n and if c<0 we estimate x, to be zero. As 

Xn 

for 3," we use 8 = 



"10 



x 10 * x 00 

If x 1Q + x Q0 = 0 We set 8 = .25.. If J > .5 we estimate 8 to be .5. We 
note that hWe, 8 represents the probability of guessing the first item 
in the item pair; the probability of guessing for the second item does not 
enter into the calculations. & 



46 



The values of c and 3 were estimated using the test results on 2,Q00 
examinees randomly sampled from the total number of examinees available. 
The 2,000 estimates were then used to compute the first three sample 
mounts t>f c which were found to be .652, ,496 and .405, respectively. 

Since the marginal distribution of c is non-central beta, we can use 
the methods previously described to estimate v„ v,+v, and X where v 0 +v 0 
corresponds to the parameter s in section 2. The estimates are 1.2231, 
.83942 and .5, respectively. Next we computed the first sample moment of 
V which was .1287. Since y is assumed to have a non-central beta distri- 
bution, it follows that the mean of y is 

P Y - l-(v 1+ v 3 )e" X /( i t V i + V*>3-l e * dt ; 

* Substituting .1287 for y , 1.223 for v , 2.062 for v.+v.+v. and .5 for X 

• , T * * 3 

and solving for v 3 yields v 3 =1.279. Thus, estimates of v,,v ,v, and X are 
v^l.2231 , v 2 *. 83942 - .1279=. 7115, -G^.1279 and X=.5. 

An alternative extension . Forji specific examinee and a randomly . 
chosen item, Jet a=Pr (incorrect response | examinee knows). We conclude 
this section by indicating that, to a certain extent, the Dirichlet- 
multinomial model can be extended to include the possibility of a>0. If 
we allow a>0, an examinee's percent correct true score is e = (l-a)s+B(l-£ 
Let Y]=3(l-a)-a? in which case 6=?+^. As long as 3>a, we have that 
0 £ 5 £ U 0 < 1 and 0<^+ y-j < 1. Thus, it is theoretically permis- 
sible to assume s and Yl have a bivariate Dirichlet distribution, -or more 
generally, their joint distribution is given by (4.1). Moreover, the 
parameters of 'the model can be estimated in essentially the same manner as 

* 

outlined above. 



16 



Referents 

Abramowitz, M. , & Stegun, I. A. (Eds.) Handbook of mathematical functions . 
National Bureau of Standards, Applied Mathematics Series, Washington, 
D.C.: U.S. Government Printing Office, 1972, 55. 

Antelman, G. R. Interrelated Bernoulli processes. Journal of the American 
Statistical Association . 1972, 67, 831-841. : 

Bardwell, G. E.,,& Crow, E. L. A two-parameter family of hyper-Poisson 
distributions; Journal of the, American Stat istical Association, 1964, 
59, 133-141. 

t 

Chao, M. T., & Strawderman, W. E. Negative moments of positive random 
variables. Journal of the American Statistical Associ ation, 1972, 67, 
429-431. ^ ; — 

Chemoff, H. The scoring of multiple choice questionnaires.. Annal s of 
Mathematical Statistics , 1962, 33, 375-393: • 

Connor, R. J. , & Mbsimann, J. E. Concepts of independence for proportions 
with a generalization of the Dirichlet distribution. Journal of the 

American Statistical Association , 1969, 64, 194-206. 

i ■ — ■ — . • 

* . * 

Darrocn, J. N. , & Ratcliff, D. A characterization of the Dirichlet 
distribution.- Journal of the American Sta tistical Association, 1971, 
66, 641-643. ' — 

* 

Duircarv, George T. An empirical Bayes approach to scoring multiple-choice 
tests in the misinformation model. Journal of the American Statistical 
Association , 1974, 69, 50-57. ~~ — ' . 

Frary, R. B. Elimination of the guessing -component -of multiple-choice 
test scores: Effect on reliability and validity. Educational and 
Psychological Measurement , 1969, 29, 665-680. 

Gupta, R. C. Modified power series distribution and some of its applications 
Sankhya , 1974, Series B, 36, 288-298. 

Hamilton, C. H. Bias and error in multiple-choice tests. Psychometrika, 
1950, 15, 151-168. , f r ' 

Harris, G.'w., & Pearlman, A. P. An index for a dpmain of completion or 
short answer items. Journal /Of Educational Statistics , 1978, 3', 
285-304. » ' 

Huynh, H. On the reliability of decisions in domain-referenced testing. 
Journal of Educational Measurement , 1976, 13, 253-264. 



48 



17 



References Cont. 



IMSL Library 1, Volume II. Houston; International Mathematical and 
Statistical Libraries, 1975. 

Ishii, G. , & Hayakawa, R. On the compound binomial distribution. Annal s 
of the Institute of Statistical Mathematics , 1960, 12, 69-80. 

James, I. R. Multivariate distributions which have bet£ conditional 
distributions. Journal of the American Statistical Association , 
1975, 70, 68^-684. 

Lord, F. M. A strong true-score theory, with applications. Psychometrika , 
1965, 30, 239-270> „ _ 

Morrison, D. 6., & Brockway, G. A modified •beta-binomial model with appli- 
cations to multiple choice and taste tests. Psychometrika , 1979, 44, 



floack, A. A class of random variables with discrete distribution, finals 
^ of Mathematical Statistics , 1950, 21, 127-132. 

Patil, G. P. Certain properties of the generalized powereeries distribution. 
Annals of the Instituted Statistical Mathematics , 1962, 14, 179-182. 

Seber, G. A. The non-central chi-squared and beta distributions. Biometrika , 
1963, 50, 542-544. ' ~ 

Wilcox, R. ft. Estimating the likelihood of a false-positive and^false- 
negative decision with a mastery test: An empirical Bayes approach. 
Journal of Educational Statistics , 1977, 2, 289-307. 

Wilcox, R. Achievement tests and latent structure models. British Journal 
of Mathematical and Statistical Psychology, 1979, 32, 61-71. 



Wilcox, R. Determining the length of a criterion-referenced test. Applied 
Psychological Measurement , 1980, to appear (b). 

Wilcox, R. R. Toward better approximations of the true Score distribution . 
Center for the Study of Evaluation, University of California, Los 
Angeles, 1980 (c). 



427-442. 




of an 



Wishart, J. A note on the distribution of th 
1932, 24, 441-456. 




ati on ratio. Biortetrika, 



49 



UNIVERSITY OF CALIFORNIA, LOS ANGELES 



tEMELET > QWIS . IRVINE * LO$ANCEL£S • MVEIbfDE • SaN DIECO ♦ S %\ FRA.\CISCO* 




SANTA BARB AAA • SANTA CftUZ 



CENTEa/OK THE STUDY OF EVALUATION 
UCLA GRADUATE SCHOOL OF EDUCATION 
LOS ANCELES, CALIFORNIA 90024 



March -28, 1980. 



W. Scott Gehman 
Editor 

Educational and Psychological 

Measurement 
Box 6907, College Station 
Durham, NC 27708 

Dear Dr. Gehman: 



Please consider the enclosed manuscript "An extension of the 
Dirichlet-multinomial model that allows true ^fcbre and guessing 
to be correlated" for publication in EPM. 



Thank you very much. 



Sincerely, 



Rand R* Wilcox 

Senior Research Associate 



RRW/kr 
Enclosure 



/ 



50* 



SOME EMPIRICAL AND THEORETICAL RESULT^ 

ON AN ANSWER-UNTIL-CORRECT y 
* , SCORING PROCEDURE 

RancfR. Wilcox. 



\ 



'CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California . Los Angeles 

and the 

DEPARTMENT OF PSYCHOLOGY * 
University of Southern California 



The project presented or reported herein was performed pursuant to a grant 
from the National Institute of Education > Department of Health* Education, 
and Welfare. However, the opinions expressed herein do not necessarily 
reflect the position or pplicy of the National Institute of Education* ahd 
no official endorsement by thkjlatymal Institute of Education should be 
inferred. ^ 

r 



51 ' 



ABSTRACT 



Wilcox (1980a) proposed a model for an answer-un til -correct scoring 
procedure that solves various measurement problems. The purpose of this 
paper is to empirically check an implication of the model, -and to, pro- 
pose and investigate some strong true-score models. One of the strong 
true-score models assumes the probability of guessing the correct response 
to an item is a strictly increasing function of an,examinee's ability 
level, and the model gives a reasonable fit to the data. The paper illu- 
strates that this new model is easily applied to situations where the 
beta^binomial model is typically used. The other models, including the 
Di rich let-multinomial model, proved to be unsatisfactory. Finally, po- 
tential difficulties with the new model are discussed, and possible direc- 



tions for future research are described. 




52 




1. INTRODUCTION 




Wilcox (1980a) proposed a model for an answer-until -correct scoring 
procedure that solves various measurement problems. In parti cular* it 
/ can be used to test whether guessing is at random, to measure hoto "far away 11 
guessing'is from being random, and to correct for guessing without' assuming 
guessing is at random*. More recently, Wilcox (1980b) described six other 
measurement problems that the model can solve. One problem was to empiri- 
cally determine the minimum number of distractors needed on a multiple-choice 
test item* Another can be described as follows: for a randomly selected 
examinee, let e be the expected number of items on an n-item test for which 

we correctly determine whether the examinee knows the correct response. How 

it * 

■> 

many examinees do we need to sample so that there is a reasonably high 
probability of correctly determining whether e has a. value above or below 
some known constant. 

Two types "of guessing were considered in Wilcox (1980a). -.The first, 
or Type I guessing* refers to situations where we have a population of 
examinees and a single item. For a randomly sampled examinee, the pro- 
bability of guessing is defined to be the Pr( correct response | examinee 
does not know). Type II guessing is defined in terms of a single examin^r^ 
and a domain of items. In particular, it is the Pr(correct response j 

examinee does not know) for^a randomly selected item* 

•\ 

In Wilcox (198fra), it was assumed that an examinee either knows the 

• ** 

correct response and answers the item correctly, or the examinee can elimi- 
nate at most t-2 distractors where t is the number of -distractors on 
the item. According to an answer until-correct scoring procedure, 
examinees choose distractors until the correct response is identified. 

/ 



ERJC 



t Assume Type I guessing, let's be -the proportion of examinees who know 

/ 

J;he item, and let c.(i-0,... ,t-2) fee •the proportion of examinees who can 
eliminate i distractors. Following Horst (1933), we also assume that 
examinees who do not know guess at random from among the distractors 
they cannot eliminate. Thus, the probability that a randomly chosen ex- 
j ami nee chooses the correct alternative on the first attempt is 

t-2 ? ' «s r 

p x « ? + ytf-i) ' . ' (l.i) 

The probability of giving the -correct response on the ith attempt is 

t-i 
3=0 

In order for the model t6 hold we must haye 



p i = A ? j /(t_j) (i=2,...,t) (1.2) 



Pi > P 2 > ••• 1 P t J U-3) 

Equation (1.3) can be empirically checked (Robertson, 1978). Moreover, 
maximum, likelihoo'd estimates of the p. 's are easily obtained when (1^3) is 
"assumed by applying the "pool-adjacent-vfolators" algorithm (e.g., Barlow, 
et al., 1972, pp. 13-18) which in turn yields maximum, likelihood estimates 
of the s*s. In particular, if in a random sample of N examinees, x, exam- 
inees choose, the correct alternative on the first try, and x^* examinees 
choose the correct alternative, on the second try, then 



C 88 (xj - XgJ/N. 



Xj> x 2 



= 0, Xj < x 2 (1.4) 

is a maximum likelihood estimate of (For alternative methods of 
scoring and analyzing answer-until-correct tests, see Dal rymple-Al ford, 



1970; Brown, 1965.) 



54 



4 



. * ' There are two m£in goals to this paper. The first is to empirically 

s % 

check the assumption in equation* (1.3) for a reasonably large number, of 
items, and the second is to propose and to empirically investigate some" 
strong true-score models based on an answer-untll-xor^ procedure. 
i We note that the importance of strong true-score model has long Keen esta- 
blished (e.g., Keats and Lord, 1962; Lord, 1965, 1969; Lord agji N6vic1<, 
1968), and more recently the^j^k^k played an important role ij^the realm - 
of 4 criterion-referenced'testing ,(e.V , Huynh, 1976, 1980; Wilcox, 1977). 

2. £MPIRICAL TESTS ,oK EQUATION (1.3) 

< As not^l above, our first goaV is to empirically determine whether 

equation 1 1.3) is reasonable when an answer-until -correct scoring procedure* 

is usecf.* To do this, we^ used 4 * test results on -£20 students enrolled in 
** * 

an, undergraduate psychology course. Each student took three tests during 
. the semester. The fiVst two tests had 37 and 40 items respectively, and * 

thq final examirtatTpn had 40 items. * All three_te^fs hard four forms' and . 
^Vll items .had t-5 distfactors. Each form, consisted of the same items, 
but they were presented in a different* order. Using refUlts. in- f Robertson 
(1978), a test of equation (1.3) wa£>,made for all 117 items. Each item 
v - . • w$s tested *f our times according to which test form it was on.' TKus, a 

total* of 468 tests were made. 

^ \ * * 

.A^the .01 level of significance, the null hypothesise/as rejected 

about 5.8 percent of theiime.^ For a little over half of the tests' it was 

j " • ' * \ 

/unnecessary to apply Robertson's procedure because the sample estimates 





of^the p. f $ already satisfied^e^inequal ity. When Robertson^ test was 

r 

applied,. the results were usually Jiighly nonsignificant. , The observed 



ERIC 9 ■ S . . - ,5o 



> 



test scores indicate that there were two items during each testing period . 
(six items in all) which did not satisfy equation (1.3>. f 

Table 1 shows the observed scores on one of the items on the final 
examination that appears noj* to satisf^Hl.3)* On Form 2, for example, 
35 examinees chose the correct response on their thirtf attempt, of the 
item. For all four forms, Robertson's test.was highly significant. The 
striking freature of this item is the larg£ number of examinees who chose 
the' correct response on their last attempt. One possible explanation is 
that examinees had misinformation relevant to the question being asked, 
and- so they eliminated the correct response from consideration. Unfort'un- 
ately, there was no way to verify this. , v 1 • 

Several of the items for which the null hypothesis was rejected had 
a response pattern similar to the one shown in Table 1. That is, the 
correct response was usually chosen last. In, another instance where the 
null hypothesis was rejected, the observed frequencies corresponding to 
the number of attempts were 20, 49, 33, 30, and ; 18, respectively. 

& - 

. . 3. STRONG TRUE-SCORE MODELS 

r- < , 

Next we consider the problem of finding a strong true score model that 
can be used in conjunction -witfh an answer-until -correct scoring procedures 
In contrast to the previous section, only Type II guessing is considered. 
We begin by .considering a single examinee responding to itgms that repre- 
sent a particular item pool. Le^t x be the proportion of items the exam- 
inee knows, and let (i-0,. . . ,t-2) be the proportion of items for which 
the examinee can eliminate i dis tractors v/hen the correct response is not 
knotfn. Finally let 8 .(j=l,. . .^5) be the probability of choosing thfe cor- 
rect response on the jtJh attempt of a randomly selected item. The situation 



56 



is essentially the same as in the previous section, but the roles of 
items and examinees are interchanged. 

Fo r future -reference, we note -that 

t-2 

e, - t + z t,/(t-i) (3.1) 

• * 

t-i. 

e, = z T,/(t-j) ' (i=s,...,t) (3.2) 
1 ■ j=0 3 ^ 



and 



Let and y 2 be the number of items on an n-item test for which the 
examinee chooses the correct response on^he first, and second attempt, 
respectively. In Wilcox (1980a) it was assured that the joint conditional 
probability function of and y« is given by 



„ fe,,), I e r » 2 )= t y ), • (3.3) 



This implies that ffyjle^) is a binomial probability function which, in 
mental test theory, has certain theoretical disadvantages. In practice, 
however, this assumption frequently gives good results. A recent discussion 
'of the issues can be found in Wilcox (1981). ' 

Note that for the model to hold, we* must have 

6 1 1. e 2 ~ * • * — e t * 

* 

and so a maximum likelihood estimate of*t is 



y x <y 2 

57 



Thft goal in this. section is to consider how we might extend (3.3) 
to a population of examinees. ' Wilcox (1980a) suggests that for a pop- 
ulation of examinees, we assume theft joint density of and or the 
joint density of x and e 2 belongs to the Dirichlet family. For the 
former case, the, joint density is^given by 

1 r(vi + v«+v,) v-i v-i , ,v,-i , 0 

«*r °2> ■ <v|)r(v 2 )r(v 3 ) h 1 \* (1 W ^ 

* 

where v 2> v 3 > 0 are unknown parameters* In the latter case we simply 
replace with x in equation (3.4),^ The motivation f<?r (3.4) is that it 
is the bivariate analog of the beta density which has proven to be useful . 
in many situations in mental test theory (Wilcox* 1981). 

Empirical Results on the Diriqhlet-mul tinomial Model 

For the reasons given above-, we began by assuming (3.4), and we then 
tried to fit the Dirichlet-multinomial model to the final examination 
test scores previously described. From results in section 2> two of the 
forty items appear not to satisfy the assumptions made under our arvswer- 
until-cdrrect scoring procedure and Type i guessing, and so they were 
eliminated. The observed ♦ma^Jial distribution of y l and y 2 for the 
remaining 38 items is shown in Table 2. * 

It is known that when (3.4) is assumed, the marginal density of e x is 
beta with parameters and v 2 +v 3 - We tried fitting the observed yj's 
beta-binomial probability function (e.g., Keats and Lord, 1962). The esti 
mates of Vj and v 2 +v 3 were 8.645 and 8.2, respectively. The expected freque; 
cies under the model are shown in Table 2. A visual inspection of Table 2 



58 



* , . g 



suggests that the beta-binomial model gives a reasonable fit to the data* 

and a chi-square goodness-of-fit test (Cochran, 1954) "Confirms this, 

Ns#t we consider the observed frequencies corresponding to y 2 - If 
• * 

(3.4) is assumed, then the marginal probability function of y 2 is 



f(y 2 ) = 



*2 



r(v 1 +v 2 +v 3 ) r(y 2 +v 2 )r(n-K> 1 +v 3 -y 2 ) 
r(v 2 )r(v 1 +v 3 J rtn-h^+v^} (3 - 6) 



i.e., a beta-binomial density with parameters v 2 and v i +v 3« The^estimate 
of v 2 was 25,6, and the estimate of vj+v 3 was 101.6. Again a good fit to 
the data was obtained. However, the estimates of Vp v 2 and v 2 +v 3 imply 
that v-j must be negative. But the Diricftlet-mul tinomial jnodel assumes 
v^>0 (i=l,2,3). We tried instead to estimate the v.*s as described in 
Mosimann (1962). This yielded 0^6.08, \> 2 =2.37 and v 3 =3,39.' We now have 
admissible estimates of the v. 's, but th^fit to data jk no longer satis- 




factory. Evidently, some other model must^e yred to explain the observed 
scores. f > 

Before describing a model that gives. a reasonably good fit to the 
data, we might mention two other models.that were considered but which 
gave unsatisfactory results. The first was a negative-multinomial model 
(e.g., Sibuya et al., 1964), and the second was a compound negative multi- 
nomial model also known as the multivariate inverse PoTya-Eggenberger 
distribution (Mosimann, 1963; Sibuya, 1980; Sibuya and Shimizu, 1980; 
Janardan and Patil, 1971). 

* 

A New Strong True-Score Model m • . 

Since a beta-binomial model gives a good fit to the observed marginal 
distribution of yp we decided to assume (3*3) holds and that has 



5!) 



beta density with parameters 8.645 and 8.2. The problem is to find a 

i 

reasonable relationship between 8j and e 2 that accounts for the observed 
marginal density of y 2 - As noted above, the Dirfchlet-multinomial model, 
as well as two other models, is unsatisfactory for accomplishing this 
goal. _ , • , * 

Our conroon sense notion is that as t increases, the probability of 
guessing the correct response will also increase. For instance, an ex- . 
amiftee with a value for x close to one. might-have moite partial information 
than an examinee for whom x is small. That is, examinees with x close 
to one/might be able to eliminate more dis tractors when they do not know 
as opposed to examinees for whom x is close to ierp. We note that Molenaar 
(in press) has also argued for this point of view. Let's assume for the 
moment that this is true, and consider how we might express this relationship. 

• After looking at the data, wi* decided to express the assumed rela- 
tionship between t and guessing in 'terms of the conditional distribution 



of y^ given y^. First note that for a specific examinee 



f (y 2 lyi> e p e 2^ = 



n-y,i 



1-e. 



y 2i. 



1-8- 



n-y r y 2 



(3.7) 



^For notational convenience, 'let i=B2Hi~B^): (fur* assumption about t and 
guessing indicates that for the population of examinees^? is. an increasing , 
function of 6y Since 0<e <1, .what we need is an increasing function that maps 
the closed unit interval into a subset of itself. One v/ay to do this is to 
use a linear function of a cumulative distribution defined on£o,f] .■ The beta 
distribution' is the best known distribution with this property* and so 
we decided , to consider it for the problem at hand. Accordingly, we assume 



GO 



that for the population of exanri/iees, E( s | e^) is given by 

*(•!)■ -T$mh ^W- 1 du ♦ e ' (3.8) 

where c, e, r, and s are unknown positive constants to be determined, .and 
where 0<c+e<l. 

Henceforth, we assume e 2 is .completely determined by e x according to 
equation (3.8). That is, for a specific examinee, e^l-e^s^) . this is, 
no doubt, an over simplification, of reality, but we want to avoid deriving 
a model so mathematically complex that it cannot be -applied. As it turns 
out, equation (3.8) gives a reasonably good fit to the data. 

^Tfext we determined c, e, r and s in the manner described in the 
appendix. The results were c=. 25, e=.25, f=1.776 and s=2.279. 

As a, partial check on the model, we decided to compare the expected 
observed scores of y 2 to the values actually observed. To do this, we 
need an expression for the marginal distribution of y 2 assuming equations * 
(3.8) and (3.3) hold, and that 6j has a beta density with parameters 8.645 
and 8.2. Writing 5(e 1 ) simply as % and since e^d-e^ equation (3.3) can 
be written as ^ 

f(y 1 ,y 2 |e 1 ) ■ nle^l^tt-ejO * 2 j>e r e(i-e 2 0 n_y r y 2 (3.9) 

and - 

, nl«%(l-e 1 )%[:i-C(l-»iO n "% ^ 
f <y 2 1 ej) y ,(„.,), (3.10) 

Substituting (3.8) into (3. 10), multiplying by g(e 1 ), and integrating out 
gCej) yields the maginal density of y 2 . Symbolically, 

« » 

f(y 2 ) » /„' f(y 2 |8j) gte^dej (3.11) 



61 



where, from previous results, we. assume 

a(e ) - r(16.845) e 7.645 ,7.2 , 0 
9{e l> ~ r(8.645)r(8.2) 1 I 1 " 6 !' , ( 3 - i2 ) 

♦ 

and f(y 2 \e 1 ) is given by (3.10). Since \ is a function of 8,, it is dif- 
ficult to find a closed form expression for (3.^2). However, for practi- 
cal purposes, this is not a serious problem since the integration is 
easily accomplished using numerical quadrature techniques. We used the 
IBM (1971) subroutine DQ632. For those who do not have access to this 
subroutine, the necessary formulas can te found in Stroud and Secrest 
(1966). The expected scores of y 2 based on (3.12) are shown in the last 
column of Table 2. The usual chi-square statistic was found to be 22.9. 
With 12 degrees of freedom the level of significance is between .025 and 
.05. -(Note that e is assumed known, as is explained in the appendix, 
and so (3.11) has three unknown paramenters since (3.12) is assumed.) 

We observe that the estimates of £ corresponding to y 1 =4, 7 and 33 
are based on a relatively small number of examinees. In fact, far y^=4 
there is only one examinee and the same is true, for y, =33. Thus, it might 
be that \ is unusually spuHous at these points, and this would explain 
why we get estimates of c that seem to be relatively inconsistent with 
the "notion that $ is a strictly increasing function of 8,. (See Table A2 
in the* appendix. ) 

It is interesting that if we ignore the estimates of £ at these 
points, we get c=.33, f .88 and £=.909 with e still equa"Uto .25. In 
this case the value of the chi-square statistic is 15.63 and the level 
of significance is between .05 and .1. In either case, we get a reason- 
able approximation to the data. Note, however, that if we assume random 



62 



12 



guessing, i.e., £ = (t-l)~* = .25, as is frequently done, we get a very 
poor fit to the data* 

Next we applied th*e model to the observed scores on the second test 
taken during the semester. We used the same 620 examinees. Again, two 
of the forty items did. not satisfy (1.3), and so they were eliminated. 
The parameters of our strong true-score model were estimated and found to 
be very similar to the estimated values based on the final examination. 
Also, 4e again got a reasonable fit to the data. 

Before concluding this section we note that the above results suggest 
we estimate t with x=§ ^§2=8^(1-^) 5. If we arbitrarily set ^(t-l)" 1 , 
we get the usual correction for guessing formula score. 



4. SOME APPLICATIONS TO MASTERY TESTS _ 
^In many instances it is a simple matter to extend existing applica- 
tions of the beta-binomial model to the model described in section 3, By 
way of illustration, we consider two problems that occur with mastery 
tests. 

A frequent goal of a mastery or criterion-referenced test is to sort 
examinees into one of twojnutually exclusive groups. In many-instances 
these groups are defined according to whether an examinee's trufe-score 
T is.jtbove or below some known constant, say tq- In* the context Qf an 
answer-until -correct scoring procedure, we decide that t £ xq if for the 
examinee being tested, (yj-y 2 )/n > x Q ; otherwise the decision x < t q is 
made. 

For a randomly selected examinee, the probability of making a correct 
decision about whether x is above or below x Q is given by 

Pr ' y r y 2-C nT <P> T i t o' +Pr ' y r y 2 < C nT (P '' t<T o' ( 4 - J ' 

" 63 . 



13 

where is the. smallest integer greater than or equal to m Q . But 

(4.1) is equal to 




where the first summation is over alT (y p y 2 ) such that y^yf^r^J, 
the second summation is over all (y r y 2 ) such that y^y^O-r^ and e Q 
is ttte value of e 1 such that Sj-U-Sj)^^. Thus, the probability of a 
correct decision*can be determined once (3,8) is estimated. 

Another approach to characterizing mastery tests is the single admin- 
istration estimate' of the proportion of agreement.. We are 
given the observed scores of N examinees, and we want to estimate the 
probability that a -randomly selected examinee "would be classified in the 
same manner if he/she took two randomly parallel tests. 

Let z j and z 2 be the observed scores corresponding to and y 2 for 
an examinee who takes a randomly parallel test. Proceeding in a manner 
similar to Huynh (1976), assume the' density f(zj, z^Sj) has the same form 
as ftyj.y^ej) which is given by (3.9). Thus,, after making the appropriate 
independence assumption, the joint density of y^, y 2 , Zp and z 2 is 

* f(y 1 ,y 2 »z 1 ,z 2 ) = jji fCypy^ej) fUpZglej) gCe^de^ J 

hich can be evaluated with IBM subroutine DQG3?. The proportion of agree- 
nt is 

zf{yyy 2 ^ v ^ 

where ;the summation is over all points where both yj-y 2 and Zj-z,, are 
greater than or equal to nt 0 , or when both are less than or equal to m Q . 



64 



. *14 

_ * 

5. DIRECTIONS FOR FUTURE RESEARCH 
We briefly describe some of the problems that might occur when using 
-the strong true-score model proposed* in section 3. 

. First, the assumption that the marginal probability function of be! 
to the beta-binomial family has yielded good results to various measurement 

v, s 

problems when applied to real data (e.g., Gross and Shulman, 1980; 
Subkoviak, 1978; Keats and Lord, 1962; Lord^ 1965). However, a? might 
be expected, this is not always the case. Keats (1.964a) reports a data 
set for which the beta-binomial model gives a poor fit, and Keats (1964b) 
reports several other data sets for which the model gives unsatisfactory 
results. Accordingly, we briefly outline solutions that might be consi- 
dered when the beta-binomial model is unsatisfactory. The details are 
left for future investigations. ' 

First we note that when trying to 4 find a probability function that 
gives a gaod fit to data, three of the best known and most frequently 
employed distributions are the binomial, Poisson and negative binomial. 
Of course, the Poissdn distribution usually gives good results when applied 
to situations where a particular event occurs infrequently. Also, the 
negative binomial distribution is often* the first choice when 1t is be- 
lieved that the Poisson- distribution might be inadequate (Johnson and Kotz, 
1969, p. 125). 

Suppose we replace (3.3) with the assumption that for a particular 
examinee, the probability of z=n-y^ is 

f(z| Y ) = e" Y r/zi (2=0,1,...) (5.1) 

i.e., a Poisson density with parameter y. If we also assume y has a gamma 



distribution for the population of examinees* the marginal distribution of 
z is negative binomial given by 

where a and 3 are unknown parameters. As noted in Wilcox (1981), this 
distribution gives a reasonable fit to the data reported by Keats (1964a) 
while the beta-binomial model* does not. We also note that Johnson and ( 
Kotz (1969) list several techniques for estimating the parameters in (5.2). 

One problem is how to represent the joint distribution of y 2 and z. 
A mathematically convenient approach is to assume y 2 is also Poisson and 
that z and y 2 are (conditionally) independent* To allow z and y 2 to be 
correlated, we might use the bivariate Poisson distribution derived by 

Hoi gate (1964). 

i 

In principle, at least, a gamma-Poisson model could be applied, and 
an estimate of £ could be derived. However, if test scores are highly 
skewed, as they are for the data in Keats (1964a), we might get poor 
estimates of £ for examin.ees with low ability because there are so few 
examinees with low ability. Hopefully the seriousness qf this problem V/ill 
be investigated sometime in the future. 

Rather than assume y has a gamma distribution, we might assume it 

» 

has a, gamma product ratio distribution in which case the marginal proba- 
bility function of z is 

fty s _ r(crf*)r(af«») . ( ct) z 1 



•66 



where a, 3» a»0 and 



(a) 2 =l, z=0 




= a(a+l)...(a+z-l), z=l,2,... 

(Sibuya, 1979). The distribution (5.3) is known as the inverse Polya- 
Eggenberger, the generalized Waring, and the negative binomial beta. 
The last term is sometimes used because if f(z|v) is negative binomial 
with parameters a and.p, and if p is beta with parameters o> and g, the 
marginal distribution of z is (5.3). 

The, rth_ factorial moment of (5.3) is 
y r = E(z^) 



. = (a) r (3) p /(«-l) (r) 



where a v ' = a(a-l). .. (a-r+1). 

• We can estimate o> by tfief method of moments by noting that 



and 



y 2 



- y 



2* 



u - 



-y, + a + 3 + 1. 



3ji- 



ai - 



*2 



-p 1 + 2a + 2g + 4 



The values of a and b can then be determined via the estimate Q^y^ and ^ 
^-We note that Irwin (1968) reports some real data for which (5 # £) 1m- 
* proves upon the fit obtained with the negative binomial, but the improvement 
Is not overly striking. 



67 



As for the joint distribution of z and y 2 , we might use the multivariate 

r 

analog of (5.3). (See, for example, Sibuya, 1980'4 Once a is estimated, 
results in Mosimann (1963) can be used to estimate the remaining parameters. 

/ 

6. CONCLUDING REMARKS 

* v 

There are two main points to this paper. First* Wilcox (1980a) made 
certain^ assumpti oris about how examinees behave when responding to test 

items according to an answer- until -correct scoring procedure. These assump- 

f" 

tions imply that the cell probabilities in a Multinomial distribution must 
satisfy a particular set^of inequalities. The data used in this .study 
suggests that these inequalities will frequently hold, i 

The second point -is that a strong true-score model was>*pposed that 
allows the probability of guessing the correct response to varx over a 



\ 



population of examinees. In particular, it, was assumed that the Viability 
of guessing correctly is a strictly increasing function of an e^a/ninees 

ability level ♦ Furthermore, the model gives a reasonably good filt to our 

* J 
data, and it allows us to correct for "guessing wi^ho^t as'sj^rg guessing 

• \ „ 

is at random. 

Finally, we have outlined some of the potential difficulties with our 
proposed model. Hopefully these issues will be resolved sometime in the 
future. - 



* • \ 



68 ' 



* . APPENDIX . , j 

t 

^The pr^lem is to derive an estimate of the parameters iri equation 

(3,8). To motivate our Solution, we first rederive the estimat^'of the 

true score^ distribution used, by Lord and Novick (1968, chapter 23)-, The 

point is that ^eyderivation is done in slightly 'different' fashion ^than 

is customary. \tle then apply this same technique to obtain an estimate 

of ? as a function of e^.y. ^ ■ 

Suppose that on an rt-item test, obsefved scores for a specif i-c exam- 
* . * 

inee h%ve a probability function given by * 



Fdr a population Of examinees, let gU) be the density of and suppose, 
we want to estimate the first two moments of One approach to this prob- 
lem is as follows: Let x. (i=l,.J.-,N) be the observed scores of *N randomly 
sampled examinees, and let f be the number of examinees with an observed 
x where, of course, if =N. Temporarily assume that every examinee's .true 

r 

score 7r has one of n+1 possible values ^ namely, nr*=i/n (i=0,.^,n). The * 

observed values of x suggest that we have sampled f examinees wit$ true 

score x/n. t Thus, an estimate of the probability of choosing an examinee 

having true score is 1i(tt • )-f v /N- Sihce tL.=x./n is an unbiased estimate' 

of it for the ith examinee, an estimate of E(ir) is 

n * n >v f Y N x,- * 

X 9. h(ir.) = Z -J -4 = E rjr- * 

. i=0 1 11 x=o " N 1=1 n 
Since n (n-1)" x(x-l) is an unbiased estimate of it » this suggests, fibr * 
similar reasons, that we estimate E(tt 2 ) with . ' 



-1 N ^" x i 

N 1 Z -r-4— 4-t— 



6y 



g^m . 2 * 

TFftse estimates of E(tt) an4-JEXir ) are the same as the ones derived in 

\ 

Lord and Nov4ck (1968). If we. now assume g(^) belongs to the beta family, 
we have their estimate, of th"e true score distribution. 

In this paper we assume that completely determines £, and that • 
C is given by (3.8K Temporarily assume that e, is discrete, and that its 
possible values are i/n (i s 0,...,n). Suppose we want to estimate the 
value of 1 5 for the possible^val ues of 8-,. We do this as 
fol lows. 



For notational convenience let y=yp and suppose f examinees getry 
items correct on their first attempt of an item. Thus, we would estimate 
that f examinees have 8j=y/n. Let hCygjy^ be the number of examinees who 
get y 2 corrects on the second try of an item given that there were y items 
for which the examinee chose the^correct response on the first try. Finally, 
let • ■ . 

T z h[y 2 \y) . 

'where h- I hCygJ y) . Then I is an estimate,, nt* j when e^y/n. 

We illustrate the calculations 'using a specific case from the data 
reported in the paper. Consider y=ll. The corresponding y^ values for 
whi ch 1r*(y2 ly) is positive are.y2=8,9,10,ll,13 and 14. * The frequencies 
(the values of h(y 2 ly)) were 4,5,1,1,1,2, respectively. Thus, h=14. 
-Since there are n=38 items^, we would estimate i( 11/38) to be 
V 139/(C38-lf)(14)rK36. 



Table Al shows the estimates of sCeJ'fos the final examination test 
scores used in the pape>. The values of £ suggest that £ is indeed an 
increasing function of 6p but occasionally £ decreases* According* we 



20 



applied the pool-adjacent-violators algorithm (Barlow, et al., 1972, 
f pp. 13-15.) to* estimate 5 under the assumption that* it is a nondecreasing 
function of 8^. The results are reported in Table Al as 

Since there are t=5 distractors for every item on the test, and since, 
for a specific examinee, 5 is the probability of a correct on the second 
'try when the" examinee is infcorrect on the first try, the values of % sug- 
gest that examinees with low ability are guessing approximately at random. 
We decided in advance to set e=(t-l)~*=.25, and the data suggests that 
thisris reasonable. Based on Table Al, we also assume that the upper value 

of £ is .50, and so we set c=.50-e-.25. 

1 

There remains the problem of estimating r and s. First, since £ is 
assumed to be a strictly increasing function of By we cannSTUse the same 
estimates of 5 for^two distinct values of By Suppose e,. (i-l,...,m) are 
m points where the estimate of 5 (the value of 5) is the same. For the 



purpose of estimating r and s, we replace the points e.. (i=i,... ,m)*with 

-1 «u. • " 

m. SQjj.* For example, in Table Al* we have that £=.305 at e,=.24 and .26. 

Thus, instead of using the two points 8^.24 a^ .26, we Assume £=.305 at 

6^.25, and that a value' of 5 at 8^.24 and .26 is not available. The 

resulting values of and the corresponding values of £ are shown in 

■ Table A 2 ' 3D " • ; f 

> ftext set n=(5--25)/l^and note that n =/ 0 8 i fffJ^fV) ^ (M^du. 

The value of n corresponding to the values are summarized in Table A2. 
They give us a step function* approximation to an assumed cumulative beta 
distribution. Thus, by calculating ^ mean and variance of this step Tunc- 
tion, we can estimate r and s (e.g., Lord and Novick, 1968, chapter 23; 
. Wilcox, 1977). For the data used here, the estimates were f=L776 and* 
s=2.279, respectively. 

ERIC M 



TABLE 1 

Observed Frequencies for an Item Not Satisfying (1.3) 



Number of Attempts 



Test 
Form 



1 
2 
3 
4 



• 1 


2 


3 


4 


5 


19 


21 


24 


36 


57 


16 


22 


35 


30 


51 


13 




33 


24 


•67 


13 


3 


42 


•34 


52 



X 



72 



TABLE 2 



Observed and Expected Scores on the Final Examination 





Observed 


• 


expected y^ 


Expected 


Expected yg 
Wien 




Observed 


when y 1 is 
bebi x 


when y2 is 
bebi 


c=e=.25 


1 

Score 


Frequency 


Frequency 


r=1.2776 


of y, 

X 


of y ? 


(8.^645, 8.2) 


(25.6, 101.61) 


s=2.279 


n 


n 
u 


O 
C 


.00 


.37 


.62 


1 

a. 


n 

u 


c 
D 


no 

.02 


1 C "7 

2.67 


3.66 


9 

L. 


n 

u 


in 
XU 


.0/ 


n cr 

9.55 


11.72 


0 

o 


n 
u 


9vl 
24 


on 

». 20 


.23. 19 


26.72 


A 
*t 


i 

X 


OH 


A O 
.40 


42. 78 


47.55 


c. 
«j 




CI 

♦ 01 


i nn 

1.00 


63.98 


69.63 • 


u 


0 


Qn 
yu 


1 07 

x.o/ 


on ^*n 

80.60 


86.30 


7 


Q 


oc 


o on 

3.20 


n*7 r" 

87.85 


92.26 


Q 

o 


4 


on 

yu 


c no 

b.Oo 


84.26 


86.12 


Q 


1 A 
X4 


7C 

CA * 


7 CO 

/.bo* 


"JO on 

72,30 


70.93 


in 

AU 


1Q 
XH 


in 71 
1U. / I 


rr nn 

56.00 


51.89 


11 

XX 


OR 
CO 


VlC 


1 VI /i /I 

14.44 


«"lft m 
39.43 


33.91 


1? 
At. 


* Of, 
CO 


or 
20 


Jo.o/ 


25.48 


19.84 


XO 


u4 


7 


r o o oo 


15.13 


10.48 


1/1 
X4 


9A 


7 


O "7 0*7 

27.87 


8.31 


4.96 ^ 




oyi 
OH 


o 
0 


" OO 0"7 

32. 37 


4.22 


2.17 


JLO 


A O. 
HO 


i 
X 


VI o 

36.43 


n n 

1.98 


.87 


17 


AO 
HC 


X 


on "7n 

39.79 


.87 


.31 


IP 
xo 


HO 


n 
U 


VI o oo 

42.22 


n t 

.37 


.12 


1Q 


A 1 
Hi 


U 


V! O C VI 

43. b4 


.12 


0.00 


90 
lU 


AR 
HO 


n 
U 


VI O CO 

43.03 


n c 

.06 


0.00 


91 
CX 


AC 
HO 


U 


/I O CI 

42. bx 


n * nn 

/ 0.00 


0.00 


00 
CC 


AC\ 
HV 


n 
U 


vin oc 

40.25 


* 1 n n n 

\0.00 


0.00 


Co 


Jo 


U 


oc n 7 
30.9/ 


\ n nn 

0.00 


0.00 


OA 
CH 


9Q 
CO 


U 


oo nvi 

32. y 4 


n n n 
0.0O 


0.00 


OR 


OR 
CO 


A 

U 


OO V! 1 

28.41 


n nn 

0.00 


0.00 


9fi 
CO 


1Q 

xy 


U 


o o cc 
23.00 


o.ou 


0..00 


01 
C 1 


' 97 

CI 


U 


xo.9/ 


o.uo 


0.00 


28 


13 


0 


14.59 


0.00 


0.00 


29 


11 


0 


10.72 


0.00 


0.00 


30 


6 ' 


0 


7.48 


0.00 


0.00 


31 


4 


0 


< 4.90 


0.00 


0.00 


32 


6 


0 


3.00 


0.08 


0.00 


33 


1 


0 


1.68 


0.00 


0.00 


34 


2 


0 


.84 


0.00. 


0.00 


35 


1 


0 


.37 


0.00 


0.00 


36 


0 


0 


.13 


* 0.00 


0.00 , 


37 


0- 


0 


.03 


0.-00 


0.00 


38 


. 0 


0 


.00 


0.00 


0.00 



Explanation of notation.y, is bebi (a-, b) means y has a beta-binomial 
density with parameters a and b. 73 

1 




TABLE Al 
















I 


A 

4 . 


• 11 


O 0 

.23 


.298 


_ 7 


.15 _ 


.41 


.298 


8 


0 1 

• 21 


.26 


.298 


9 


OV! 

.24 


.32 


.305 


1 n 
1U 


• 26 


on 

.29 


.305 


11 


.29 


.36 


.355 


Id 


• 32 


.35 


• 355 


13 


.34 


.37 


.37 


14 


0"7 

.37 


.39 


.385 


15 


• 39 


.38 


.385 


lb 


A O 

.42 


0 

.42 


.41 


17 


.45 


.40 


' .41 


18 


.47 


.40 


.41 


18 


.47 


.40 


.41 


1 n 

19 


.50 


.45 


.43 


on 

20 


C" O 

*53 


.42 


.43 


0 1 
21 


.55 


.43 


.43 


22 


.58 


.48 


.44 


23 


, .61 


.42 


.44 


24 


.63 


.43 


.44 


or 

25 


.66 


.43 


.44 


'26 


.68 


.50 


.46 


27 


.71 


.42 


.46 


-28. 


.74 


.55 


■ .50 


29 


.76 


.45 


.50 


30 


.79 


' .63 


.50 


31 


'.82 


• .57 


.50 


32 


' .87 


.20 


.50 



TABLE' A2 



> 2 : 47 .24 .30 .34 .38 .45 .53 



5 : .298 , 305 . 355 . 37 . 385 . 41 .43 



.19 .22 .42 .48 .53 .64 .72 



>j: .62 .70 .80 



K : .44 .46 .5 



.76 .84 1.00 



4 



25 



References 



Barlow, R. E. , Bartholomew, D. J., Bremmer, J. M. , & Brunk, H. D. (1972) 
Statistical inference under order restrictions. New York: Wiley. 

Brown,' J. (1965) Multiple response evaluation of discrimination. The_ 
British Journal of Mathematical and Statistical Psychology , 18, 
' 125-137. 



Cochran, W. G. (1954) Some methods for strengthening the common X tests. 

Biometrics , 10, 417-451. . 
Dal rymple-Al ford, E. C. (1970) A model for assessing multiple-choice 

test performance. British Journal of Mathematical and Statistical 

Psychology , 23, 199-203. 
Gross, A. L., & Shulman, V. (1980) The applicability of the beta-binomial 

model for criterion-referenced testing. Journal of Educational 

Measurement , 17, 175-202. 
Holgate, P. (1964) Estimation for the bivariate Poisson distribution. 

Biometrika , 51, 241-244. 



Horst, P. (1933) The difficulty of a multiple choice test item. 
Journal of Educational Psychology , 24 , 229-232. 



negative error rates in mastery testing. Psychometrika , 45, 107-120. 
Irwin, J. 0. (1968) The generalized Waring distribution applied to 
accident theory. Journal of the Royal Statistical Society ^ 131 , 
Series A, 205. ' 



Kuynh, H. (1976) On the reliability of decisions in domain-referenced 

testing. Journal of Educational Measurement » 13, 253-264. 
Huynh^iL (1980) Statistical inference for false positive and false 




• 76 



26 



Janardan, K. G., & Patil , G. P. (.1971) The multivariate inverse Polya 
distribution: A model of contagion for data with multiple counts 
in inverse sampling. Studi di Probabilita, Statistica e Ricerca 
Operativa in onore di Giuseppe Pompilj, Oderisi-Gubbio , > ,1-15. 

Johnson, N. , & Kotz, S; (1969) Discrete Distributions . New York: Wiley. 

Keats, J. A. (1964) Some generalizations of a theoretical distribution 
of mental test scores. Psychometrika , 29, 215-231. 

Keats, J. A. (1964) Survey of test score data with respect to curvilinear 

relationships. Psychological Reports , 15„ 871-874. 

Keats, J. A., & Lord, F. M. (1962) A theoretical distribution for mental 

* 

test scores. Psychometrika , 27 , 59-72. 
Lord, F. M. (1965) A strong true-score theory, with applications. 

Psychometrika , 30, 239-270. 
Lord, F. M. (1969) Estimating true-score distributions in psychological 

testin^(an empirical Bayes estimation' problem). Psychometrika , 

34, 259-299. 

Lord, F. M., & Novick, M. R. (1968) Statistical theories of mental 

test scores . Reading, Mass.: Addison - Wesley* 
Molenaar, I. W. (in press) On Wilcox's latent structure model for 

guessing. British Journal of Mathematical and Statistical Psychology . 
Mosimann, J. E. (1962) On the compound multinomial distribution, the 

multivariate 3-distribution, and correlations <among* proportions. 4 

Bioroetrika , 49, 65-82. . K 

MosVmann, J. E. .(1963) On the compound negative multinomial distribution 

and correlations among inversely sampled pollen counts. Biometrika , 

50, 47-54. 



77 



Robertson, V. (1978) Testing for and. against an order restriction on 
• multinomial parameters ♦ Journal of :the American Statistical 
Association , 73, 197-202. 
Sibuya, M. (1979) Generalized hypergeometric, digamma and trigamma 
„ distributions. Annals of the Institute of Statistical Mathematics , 
373-390. 

Sibuya, M* (1980) Multivariate digamma distribution. Annals of the 
Institute of Statistical Mathematics, 32, Part A, 25-36. 

Sibuya, M. , & Shimizu, R. (1980) Classification of the generalized 

hypergeometric family of distributions. The Institute of Statistical 
Mathematics > Research memorandum No. 192. 

Sibuya, M. , Yoshimura, I., & Shimizu, R. (1964) Negative multinomial 
distribution. Annals of the Institute of Statistical Mathematics , 
16, 409-426. ' ^ 

Stroud, A. H., & Secrest, D. (1966) Gaussian quadrature formulas . 
New Jersey: Prentice-Hall. 

Subkoviak, M. 0. (1978) Empirical investigation of procedures for esti- 
mating reliability for mastery tests. Journal of Educational 
Measurement , 15, 111-116. 

Wilcox, R. R. (1977) Estimating the likelihood of false-positive and 
false negative decisions in mastery testing: An empirical Bayes 
approach. Journal of Educational Statistics , 2_> 2&9-307. 

Wilcox, R. R. (1980) Solving measurement problems with an answer-until- 
correct scoring procedure. Applied Psychological Measurement , 
in press. : 




78 



28 



Wilcox, R. R. (in press) Using results on k out of n sysMn reliability 
*to study and characterize tests. Educational and Psychological 
Measurement . 

Wilcox, R. R. (1981) A review of the beta-binomial model and its extensions 
Journal of Educational Statistics , j5, 3-32. 

$lcox, R. R. (in pre(jT) The single administration estimate of the pro- 
portioned^ a greemen t^ of a j njf i c i en cy test scored with a latent 
structure model. Educational and Psychological Measurement . 



70 

i 



« SOME NEW RESULTS ON AN 
ANSWER-UNTIL-CORRECT SCORING PROCEDURE 

* 

Rand R. Wilcox 



DEPARTMENT OF PSYCHOLOGY 
University of Southern California 
Los Angeles, California 90007 

and the * 

CENTER FOR THE STUDY OF EVALUATION . 
Graduate Schopl of Education 
University of California / Los Angeles 90024 



so' 



k 

ABSTRACT 

Wilcox (1981a, 1982) proposed^ method of scoring and analyzing 
achievement tests and achievement test items thatjinjght be used to solve 
various^ measurement problems including correcting for guessing without 
assuming 'guessing' is at random: The new procedure is based on certain 
assumptions about how examinees behave when taking an answer-until -correct 
test. Certain implications of these assumptions have been empirically 
checked and the results suggest that Wilcox's model will frequently be 
reasonable. ^ The purpose of this paper -is to see whether similar results 
will be obtained when a different type of achievement test is used with 
a substantially different population of examinees. Included is a simpli- 
fication of Wilcox's strong true-score model that gives a good fit to one 
of the data sets. The paper also notes that a knowledge or random 
guessing model is highly unsatisfactory when trying to explain the observed 
test scores. Finally, a new model for measuring misinformation is proposed 
and found to give good results with two of the items. 




ERIC 



r 




Under an ariswer-until -correct (AUC) scoring procedure, examinees 
* J 

choose alternatives on a' multiple-choice .test item until the correct 
response is identified. In -the past this has been accompVtehed by having 
examinees erase a. shield on an answer sheet which reveals whether the cor- 
rect response was chosen. If an incorrect alternative was^selected,, 
another shield is. erased, and this process continues until the examinee 

« 

chooses the correct alternative. # 

Wilcox (1981a; 1982, in press a) proposed a method of scoring and ~ 
analyzing AUC tests that solves various measurement 'problems. These in- 
clude correcting for guessing without assuming guessing is at random, 
testing whether guessing is at random, measuring "how far away" guessing .is 
from, being at random, estimating the accuracy of know/don't know decisions 
when a conventional scoring procedure is used, and empirically determining 
the number of distractors needed on a multiple-choice test. -Wilcox also 
derived a strong true-score model that 'allows the probability of guessing 
the correct response to vary over the population of examinees, and the 
model also allows true score and the probability of guessing to be correlated. 
The new model contains "the beta-binomial model (Lord & Novick, 1968, 
chapter 23^ Wilcox, 1981b) and the Morrison & Brockway (1979) model as a 
special case. The scoring procedure has been applied to criterion- 
referenced tests (Wilcox, in press b, in press c) and found to Substantially 
reduce the problems^ noted by van den Brink and Koele (1980) and Wilcox (1980). 

The purpose of this paper is to empirically investigate certain im- 
plications of the assumptions made by Wilcox, to suggest a new 'model for 
measuring misinformation, and to indicate a modification of Wilcox's strong 
true-score model that might be used in certain situations*. 

- 

DO * 




2. Methods and results 1 ) 



n 



Consider a randomly sajiiple^examihee responding to a specific test 
f 'item under an AUC scoring procedure', ^et p. be the probability that the 
"correct resp^se fs chosen on the ith attempt of the item, and suppose 
that Examinees .y/hcf do not.fcnow the correct response can eliminate at most 
^ t-2 dis tractors' from consideration via partial information. Once the 
examinee eliminates ak many distractors as he/she can, a response is 
chostrf'at random from among* those remaining. If the randomly sampled 
pxajpin^e knows the correct Vesporisfe, it is assumed that the correct alter- 
native is. chosen on the first attempt.^ 

If 5 is the pr^pprtipn of $xamirffees whojknow the correct response, 

and if 5. is- the proportion who can eliminate i distractors-, then the p.'s 

W ' I ' ' 1 

-can be written as linear combinations of the e's. For example, if there 



are t=4 alternatives, * 

■ • .p 1 = c + c 0 /4 + C] /3 + c/2 

P 2 =*• S Q /4- + + c 2 /2 
P3 = " V 4 + ¥ 3 ' 

• and j>o^ f * ^ * m ^ 



s 
* 



1 



4 



Thus, if N examinees are tested^and if x. examinees are^cowect on their 
ith attempt of anntem,$he*est$riate of c is simply c = (x-j-XgJ/N.* 
Moreover, the -above -restfH^ easily generalize 1fo any't* (Wilcox, .1981a),. f 
and it .can be Sfier^fhat . . . 



< 

A test of Jfc^hanical abilities was administered to examinees 

- „ % ' ' * • 

in Great Britain who were approximately 14 years old. Each item required 
the examinee to apply some physical law' in order to solve a problem. 
For example, one of the questions was stated as follows: 

"Where can a jet plane 'not fly?" 
The alternatives were (A) over de^j water", (B) over high mountains, 
• (C) over mountains on the moon,(D) very low, (E) 8 miles above the earth. 
Resutts'in Robertson (1978) were applied to each of the 30 items to 
test whether equation 2.1 might holds. The first 15 items had t=5 
alternatives, anfl^the remaining 15 had t=3. The x. values are shown in 
Table 1. Th,ere were 386 examinees, but some examinees omitted certain 
items. For 20 of the items, Robertson's test was not necessary sincere 
estimated p. values were already- con si stent with equation 2.1. Among 
the remaining items two. were significant at the .01 level, (items' 7 and 
30 in Table 1), one was significant at the .05 level (item 29), and the 
remaining items were not significant at the .25 level . ^pb, 

; ■ *« 

. J. THE MODEL .AS A DIAGNOSTIC TOOL t, ' 

When measuring achievement, particularly within an instructional set- 
ting, it would be helpful ^t^have some method of detecting misinformation 
identifying the type of misinformlyon being used, and (jhen it exists, 
measuring how pervasive this misinformation is. Of course the teacher's 
judgment of how the students fcre behaving on a test is anv integral part <^f 
diagnosing misinformation. Tfc-e results reported here are intended to sup- 

pigment or possibly help verify the 'teacher's view. Included is ,a modifi - 

( * < 
cation of Wilcox's model which might be helpful }n this endeavor, 

•As noted ih the previous section, item 7 proved to be> inconsistent 

wijjh Wilcox's model, and the natural, rejfction is to try to determine why 

w6 got this % resul t. The $ tern was. worded as^followsr 



block of iron weighs 40 newtons at room temperature. When 
it is heated until it, is red hot it gets bigger. How much will 
- it weigh when red hot? ^ ^ 

m (A) 39 newtons, (B) 40 newtons, (C) 40.5 newtons, (D) 41 newtons, 
and (E) 42-newtons. 

It seems reasonable -that some examinees might believe that because the 
iron is bigger when red hot, it should weigh more. Thus, /xaminees will 
eliminate A and B from consideration and choose from among the responses 
C, D, and E. If the proportion of examinees acting in this manner is 
reasonably large, we would expect a disproportionate number of Examinees 
requiring 4 attempts to identify the correct response, and this is* consis- 
tent with the* frequencies 4 ^ Table 1, * 

For the reasons just outlined, it seems that Wilcox's model is inap- 
propriate for item 7,-arrd theft the foHowing model be usedin its placer 

Let 5 be the proportion of examinees who know the correct response, and 
suppose that examinees who know are always correct on tffpir first attempt* 
Let be the proportion who do not know and choose alternatives at random, and 
let ^2 &e the proportion of examinees who beflieve that the iron weighs more 
when heated because it is bigger. If these three categories are the only 
ones to which an examinee can belong, then 

P x - C *+ ' (3.2) 

_ .' P 2 = C x /5 (3.3) 
• P3 = ^ 5 ■ ' (3.4) 

P 4 = C 2 + c l /5 (3.5) 

♦ 

and 

P 5 - ' < ' 

Note tha«t this mode? fs similar tb the misinformation used by Duncan {1974). 



4 



An obvious implication is that |5 2 =p 3 =p 5 . The unbiased, unrestricted 
maximum likelihood estimates of the p.'s are p^.425, p 2 =.106, P 3 =.101, 
p 4 =.244, and p 5 =.124*. 

Let p be the common value of p 2 , p 3 and p g under the assumption the 
model holds. Then the maximum likelihood estimate of p is just 
(p 2 +p 3 +p 5 )/3=.110 (Zehna, 1966). The maximum^ likelihood estimates of 

Pi and P4 are still .425 and .244 respectively. A chi square goodness-of- 
2 

fit test yielded X =1.055 with one degree of freedom, and this is not ' 
significant at the .25 level. Thus, the model is reasonably consistent 
with the obseryed Scores on i1#m 7, and the maximum likelihood estimates 
of c, ?1 and ; 2 are £=.312, ^=.55, and E' 2 =.134, respectively. . ] 

The misinformation model just described assumes that examinees who 
incorrectly eliminate response B will choose the correct response on ' 
their fourth attempt. However, a slightly more general mode] can be 
applied. In particular, let j be the probability that examinees with 
misinformation will choose the. correct response on their fourth attempt 
once they learn that responses C, D, and E_are incorrect. Then equations 
(3.5) and (3.6) become 

P 5 = 0-y)s 2 + C]/5 * * ; 

Us ^equations (3.3) and (3.4) to estimate c^, we now have that 

l } )= 5^.106 +.101)/2,= .5175. Substituting this result in the remaining 

equations yields I = .3215, L = .161, and C = .873. 

' , • . • c <■'•.- 

4. AN EMPIRICAL CHECK? OF WILCOX'S STRONG TRUE SCORE'.. MODEL 
. Wilcox (1982) proposed a strong true-score model for answer-until- 

II 

^W.^ . correct tests that tan be described as foljlows: Consider a specific 

■'..< ' 'V 

eric ; . . . - • s c . n 



7 



examinee responding to n items. Let'y. (h = l,...,t) be the number of items 
for which the examinee chooses the correct response on the ith attempt. 
Assume that the probability function of the y^s is multinomial, i.e., 

0 

t y i 

f(y r -..,y t |e p ...,e t ) = n! e. /yA 

where the e i 's are upknown parameters, ze^l, and zy.=n. Wilcox assumes 
^that for the population of examinees, the marginal distribution of y^ is 
beta-binomial given by 

B(r+y l5 n+s-yj) 



f(y,) - 



n 



(4.1) 



where r>0 and s>0 are piknown parameters, andB is the beta function. 
Note that this assumption has proven to be useful when addressing various 
measurement problems (Wilcox, 1981b)., 

Next let 5-82/ ( 1-8 j). ' Wilcox assumes that examinees with high ability 
are more likely to guess thejcorrect response when they do not know. This 
assumption was expressed in terms 0^ by assuming that for the population 
of examinees, it is an increasing function of 8j.. In particular, E(c| e^) 
is assumed to be given by 



ri r(v,+v ? ) Y 1 V 1 " ' 1 

'In fTvjrrfvp" 9 (1 - e)2 d6 + (t - 1) l 



4^ 

% __ 



where c, vj and are unknown parameters satisfying 0<c<l-(£-l)^ J* 
vy>0 and v 2 >0 * Since for a specific examinee 

% 



ERIC . ." . * : 87 



• 8 



it follows that v 



E e (y 2 | yi ) = EU|y 1 ) " ■ (4.2) 

where E Q means expectation over the population of examinees (f.e., over the 
joint distribution of ej and e 2 ). This last result leads to an estimate of 
, c, and v 2 , and the details are given by Wilcox (1982). 

First we tried fitting Wilcpx's model to the items having t=5 dis- 
tractors. As already pointed out, one of these items appear\ not to . • 
satisfy equation 2.1, and so it was eliminated. For the remaining 14«items, 
the parameters in equation 4.1' were estimated to be r=6.565, and s-6.487. 
The observed and expected frequencies ar^hown in columns two and three 
of Table 2, As r can be seen, there is close agreement among ,the~corres- 
ponding values, and a chi-square goodness-of-f it test is highly nonsignificant 
Note that the items with t= 3 alternatives could have been included, but 
they were analyzed separately in order to illustrate a simnl ification of 
tt^e jnodel that might be useful in certain situations.. 

Next, c, vj and v 2 were estimated to be c=.5, 0^1.2396 and v 2 ~.5692. 
The model assumes that for every examinee e^d-e^t* where £ is given by 
equation 4,2. This implies that the marginal distribution of y 2 is 

f(y 2 ) 55 \ f(y 2 |e 1 )g(e 1 )de 1 , (4.3) 

where 

ft! //\ , \~\ y ? / i / -i ^ \ x n -y. 



f(y 



5; 



* 



^ ," 

andiwhere from previous results, gfej) is assumed to be a be"ta distribution 
with parameters r=6.56^ and §=6.487\ Thus, a check of the model is obtained \ 
_by determining whether the right-hand side of equation 4.3 gives a good 
approximation to the observed marginal distribution of y«. Equation '4.3 
was evaluated with IBM (1971) subroutine DQG32. The observed and expected 
values for y 2t are shown in Table 2. ;\s can be seen, equation 4.3 gives- 
a reasonably good approximation to the observed frequencies, and a chi - 
square test is not significant at the. 05 level. 

A Random Guessing Model 

It is interesting to see What happens when a random guessing model 
is assumed to hold. The expected frequencies for y 2 were computed, and 
they are shown in Table 2. It is clear that a random guessing model gives 
totally unsatisfactory results, and a goodness-of-fit test is highly 
significant. This result is consistent with results in Wilcox (1982) as 
well as Bliss (1980) and Cross & Frary (1977). 

Analysis of Items with t=3 Alternatives 

The analysis of the items with t=3 a^tematives j^eye als that in some 
'instances, a simpler, version of Wilcox's model might^i^ed. The motiva- 
• tion for this unification arose as (follows: When estimating c, v,» and v 2 > 
the value of £ is estimated at eacfj of the values, and it is assumed 
that these values are strictly increasing! For the items having t=3 
alternatives, the estimates of £ corresponding to y 1 =2(L)15 were .578, 
.577, .654, .615, .582, .564, .448f .57$, .52, .636, .595, .552, and .57. »• 
There were no cases for y^O or 1. If the estimation procedure used by 



ERIC * J 



Wilcox is applied to these values/ the results indicate a slight increase 
in £ with increasing values of y 1 , but the increase would seem to be too 
small to be concerned about* This suggests that a simpler model be 
considered where the 5 values are replaced by their average which is £=.54?. 
Thus, for a specific examinee it is assumed that 8 2 =.547<l-e 1 ), Next 
replace (l-e^ with .547(1-0^ in equation 4.4, and replace f^I^}' in 
equation 4.3 with the resulting expression* Again g(e^) was assumed to be 
a beta distribution, and the estimate of the parameters was found to be 
f=5.9877 and s=4. 5207. The last two columns of Table 2 show the observed 

/ 

and expected frequencies of y^, and the level of significance is greater 
^ thaa .1. ^ 

— - - CONCtUDINfr REMARKS - ■ ------- - 

Empirical investigations (Bliss, 1980; Cross & Frary, 1977) have shown 
that a random guessing model may be untenable, and it has been argued that 
such an assumption will frequently be unrealistic (e.g., Lord & Novick, 
1968, p. 309). All indications are that guessing will be higher than ran- 
dom, and the strong true-score model described here is consistent with 
these results,. Moreover, our common sense notion is that guessing should 
not be ignored, and in certain situations analytic results* show that guess- 
ing can be a serious problem, (van deft Brink & Koele, 1980; Wilcox, 1980). 
Since all indications are that the assumptions about how examinees behave 
under answer-until -correct tests will frequently be consistent with ob- * 
served test scores, perhaps it> is now possible to N deal with guessing in 
a more effective manner. 

t 

* 

" 00 • ' 



11 

Another important point made by a referee is that investigators might 
want to collect pretest data under an AUC procedure even if the procedure 
is not to be used in operational versions of the test. Various possibilities 
are discussed elsewhere (Wilcox, 1981a, 1981c, in press a). These include 
the ability, of estimating test item accuracy under conventional scoring 
procedures, and estimating the effectiveness of the distractors. If 
these values are judged to be too small, it might be possible to correct 
the problem by modifying or replacing some of the^d is tractors. 

Another situation where AUC tests might be useful involves the 
biserial correlation. When estimating this value, improved information 
about 5 might be useful (Ashler, 1979). 

A third possible application is the empirical derivation of a formulu 
score that"cofrects Tor guessimfliHthout assuming guessing is at random 
(Wilcox, 1982). Once certain parameters are estimated,, this scoring 
formula can be used when the only available information is an examinee's 
observed number-correct score. - x i 

Finally, it is not being "suggested that Wilcox's model be routinely 
.applied. Instead, it is being argued that if the underlying assumptions 
seem reasonable, and if the observed test scores' at?e consistent with 
these assumptions, then Wilcox's model might be considered when scoring 
and analyzing a test. 



11 



^ REFERENCES 

Ashler, D. Biserial estimators in the presence of guessing. Journal of 
Educational Statistics , 1979, 4, 325-356. 

Bliss, L. B. . A test of Lord's assumption regarding examinee guessing behavior 
on multiple-choice tests using elementary school students. Journal of 
Educational Measurement , 1980, 17, 147-153. 

Cross, L. H., & Frary, R. B. An empirical test of Lord's theoretical 

results regarding formula -scoring of multiple-choice tests. Journal 
pf Educational Measurement . 1977, 14, 313-321. , 

Duncaj^G. T. An empirical Bayes approach to scoring multiple-choice tests 
i»^the misinformation model. Journal of the American Statistical 
Association , 1974, 69, 50-57. 

IBM Application Program, System 1360. Scientific subroutines package 

• * 

(360-DM-03X) Version in^ programriej^TOariual . White Plains, NY: 

IBM Corporation Technical Publications Department, 1971. 

Lord, F. M. , & Novick, M. R. Statistical theoMes of mental test scores . 

* Rea-ding, Mass: Addison -Wesley, 1968? " 

Morrison^ D, G., & BrgeKway, G. A modified beta-binomial model with 

/ > • • 

applications to multiple choice and taste tests. Psychometrika . 



1979, 44, 427-442* / 
Robertson, T. Testing for and against an order restriction on multinomial 

parameters. Journal of the American Statistical Association . 1978, . 

73, 197-202. m 
van den Brink, W. P., & Koele, P. Item sampling, guessing and decision- 

making in achievement testing.* British Journal of Mathematical and 

Statistical Psychology , 1980, 33, 104-108. 



92 



12 f 



Wilcox,- R.R. Determining the length If a criterion-referenced test. 

Applied Psychological Measurement 1980, 4, 425-446. 
Wilcox, R.R. Solving measurement proj)leh)s with an answer-until -correct 

scoring procedure. Applied Psychological Measurement , 1981, 5 

to appear, (a) 

j 

Wilcox, R.R. A review of the beta-binomial model and its extensions. 

Journal of Educational Statistics , 1981, 6, 3-32. (b) 
Wilcox, R.R. A polarization test for making" inferences about the entropy 

of multiple-choice test items .- Unpublished technical report, 

Center for the Study of Evaluation, UCLA, 1981. (c) 
Wilcox, R.R. Some empirical and theoretical results on an answer-untiy- 

correct scoring procedure. British Journal of Mathematical and 

Statistical Psychology , 1982, to appear. 
Wilcox, R.R. Using results on k our of n system reliability to^study 

and characterize tests. Educational ancTVs.ychological Measurement ! 

in press, (a) 

Wilcox, R.R. Determining the length of multiple-choice criterion* 

referenced tests when an answer-until -correct scoring procedure is 
used. Educational and Psychological Measurement , in press, (b) 

Wilcox, R.R. A closed sequential procedure for answer-until -correct tests, 
Journal of Experimental Education , in press, (c) 



( 



ERIC ' ' I . - ° \ 



; - TABLE. 1 , ■ 

Number of Examinees Needing i (i=l>...»t) Attempts 
to- Get the Correct Response 

i 

4 







ATTEr 


?TS 






ITEM 


1 


2 


3 


4 


5 


i 

X 






in 


3 

o 


3 ' 
0 


9 






fin 


7A 

/H 


AQ 

ny 


3 


COO 




A7 

HI 


9A 
C l e 


1 3 

JLO 


4 


XI C 


88 
oo 




An 

HU 


93 
tj 




JLO^ 


OJL 


LI 
H/ 


At; 
ho 


90 


u 




8H 


OJL 


• 1A 


li 


7 


TGA 


A1 

HJL 


oy 


QA 
39 


AQ 
HO 


a 


1QR 


88 
oo 


oh 


37 
0/ 


Q 


1 7A 


M 
oy 


OO 


A*3 


AG 4 
Ht> ' * 


xu 


1AA 

JLHO 


10 
/U 




£8 
DO 


9£ 
CO 


1 1 

XX 


*9QH 


OH 




JLO 


91 


12 


90 ^ 


/ □ 


OU 


OU 


97 
CI 


xo 


Tte 

Ijj 


infi 

JlUO 


fiA 


A9 
He 


30 4 " 

oy* 


1A 


931 
COX 


in 
/u 


* *3A 
OH 


CO 


-93 - - 


XO 


79 


20 


£7 


78 
/ o 


73 


JLD 


CHO 


yu 


AQ 






XI 


xoo 


19R 


01 
y X 




\ 


18 


111 


73 


38 






19 


-228 


140 


14 






20 ' 


272 


54 


56 






21 


220 


89 


72 






22 


'257 


85 


37 ■ 






' 23 


308 


47 


22 






24 


151 


111 


- 83 






A 5 


121 


130 


119 






241 


• 88 


38 






\ 27 


235 


79 


50 






28 


232 


76 


54 






29 


101 


121 


140 ■ 






- 30 


94 ' 


' 101 


168 







s 



\ 



94 



TABLE 2 

Observed an Ejected Frequency 



Expected Y£ 

Observed y% Expected y2 Frequencies Observed Expected y? 

Frequencies Frequencies Under Ran doai Frequencies Frequencies 

^Value' t=5 • t=5 Guessing t^3 t=3, >=.547 

— \ ■ 1 

0 23 24.51 68.16 16 13.24 

1 64 - 70.77 115.82 3 3 35.25 

2 ■• U4 .99.73 101.73 54 ' 54.24 
3- 82 90.04 60.09 1 70 62.97 

'4 • 72 . ' 57.65 26.29 , 45 • 58.97 

5 31 27,54 8.91 : 48 46.57 

6 12 10.09 2.39 -36 . 31.45 

7 5 ' 2.87 .51 17 18.28 

8 1 .63 .09 8 9.07 

9 .7 3.33 
10 2 1.34 




CENTER FOR THE STUDY OF EVALUATION 
Graduate* Sch'ool ofl Education 
University of California, Los Angeles 



ABSTRACT 



V 



For a specific achievement test item and a randomly selected examinee,- 



let p be the probability of correctly determining whether the examinee knows 

the correct response. "Various techniques have been proposed for estimating 

« 

p. The purpose of this brief note' is to describe and illustrate* how Results 

i 

in the engineering literature on "k out of n system reliability "can be 

'used to study and characterize tests based -on the estimated values of p. 

In particular, we can empirically ( determi Re the minimum number of distractors 

required for multiple-choice tests. If we estimate p within answer-until- 

cprrect 'scoring procedure, we can also determine the minimum number of 

exantinees needed to be reasonably certain about whether y is less than or 

< » 

greater than some predetermined constant, where Y = Ep.j and P-j is the value 
th 

of p for the i — dtetn on an n-item test. Iij otherwords, we can determine 
whether the, expected number of correct decisions on an n-item test is 
reasonably large* , * 



r 



Suppose we have a multiple- choice achievement test atem that represents 
a particular skill . If an examinee chooses the correct response, we decide 
he/she has acquired the'skill. As indicated in Section 2 of this paper, J, 
there are several methods for estimating the probability that "for a typical 
examinee, we correctly decide whether the skill has been acquired. Usually 
however, these techniques have not been used to. analyze tests\f#iat measure 
n skills, and they have not been ysed to empirically determine how many 

0 

distractors we need for an item. The purpose of this paper is to illus- 
trate how results in the engineering literature on "system reliability" 
can be used to help solve these problems. Section 3 reviews the results we 
will 'need. Included is a slight extension of an existing theorem which," as 
will be illustrated, is useful when addressing certain measurement problems. 
Section 4 describes six examples of how these techniques might be applied., 

2- Methods for Estimating Item Accuracy 

* » 

Under normal testing procedures it is impossible to estimate the pro- 

bability of making ap inco # rrectTdecision about whether an examinee has 
acquired a £kil1. In particular, there is no estimate of the probability 
of guessing the correct response when an examinee does not know, nor is there 
an estimate of the probability of knowing and being incorrect because of 
carelessness or a momentary distraction. However, there are circumstances 
under which these probabilities can be estimated. 

One approach is suggested by Wilcox (T980). Consider a multiple- 
choice test item wvttfi: alternatives, one of which is correct. For a popu- 
lation of ex^wfnees, let c be the proportion who know t^e correct response, 



andHet c-(i=0,l M . . ,t-2) be the proportion of examinees who do not know 
but who can eliminate i distractors. Suppose an answer-until -correct sgor- 
ing procedure 'is used which means that examinees choose alternatives until 
the correct one is identified. If examinees who know are always correct 
on the first-choice, and if examinees who 'do not know guess at random from ^ 
among those distractors they canno.t eliminate, then for a randomly selected 
examinee, the probability of a correct on the first alternative chosefr»is 



t-2 

5 + 2 C,-/(t-j). 



. j=0 J 

th 

The probability of a correct on the i— alternative chosen is 

* t-i 
, v 3 £ C«/(t-j). C 1 = 2»-.» t). 

-Suppose we decide that a testee knows the answer if the first alternative 
chosen. is correct. The probabilities of the four possible otitcomes are 
shown in Table 1 . ~i 



TABLE 1 

Four Possible Outcomes 'of a Randomly Selected Examinee 
Responding to an Item 
Decision 
Knows . Does Not Know 



Latent 


Knows J 




6 


State 


Dctes Not Know 


• < T 2 


t 

* T. 

1=2 1 



r 



9y 



Thus, the probability of a correct decision for a randomly selected examinee 
is ' • 



t 

P = C* + I T. ' " . 

i=2 1 

It can« be shown that for fixed e, p attains 'its maximum value when guessing 
is at random, i.e. , c Q "= T-C. - 

/ 

For a random sample of N examinees, let z. be the number of times 
a correct response is given on the i— try. Then (z-j-z^/N, and (n-z 2 )/N / 
are unbiased maximum likelihood estimate of c and p respectively. Unbiased 
maximum likelihood estimates of them's are also readily obtained as is 
illustrated by Wilcox (1980) for the special case t=4. 

Another way to estimate the accuracy of decisions about whether the 
typical' ekami nee, has acquired a particular skill is with latent structure 
models, Macready and Dayton (1977) illustrate this for the case of equiva- 
lent items. Two items are defined to be equivalent if sr randomly sampled 

* * * * 

examinee knows both or neither ope. In. addition to including guessing, the 
model used by Macready and Sayton allows for the event ofah examinee 
knowing and being incorrect. % 

There are four methods for checking the assumption, of equivalent items 
(Macready and Dayton, 1977; Hartke, 1978; Baker ancl Hubert, 1977;, and Wil- 
cox, in press, a). If the assumption of equivalent items is contraindicated 
by the data, we might still use a latent structure model, but one based on 

4 f 

less stringent assumptions. In particular, we might assume items are hier- 
archically related which contains the assumption of equivalent items as a 



luo 



special case (l^lcox. in press, b).' Dayton and Macready (1976") describe 
a general approach to hierarchically related items. 



.3. Review of "Reliability" Theory 

• \ 

Suppose a test measures n skills. ^For a randomly selected examinee 
le't x.=l if a correct, decision in made about whether the i— skill has been 
acquired; otherwise, x.=0. Also, let P^Ex. jwjrere th£ expectation is , 
taken over the population of examinees. As noted in the previous section, 
p- can be estimated under various circumstances. We define the k out of n 
reliability of a test, p^, to be the probability of making at least k correct 
decisions for randomly selected examinee. Symbolically, p. =Pr{Ex.>k) . 

Some readers might object to defining test reliability in the manner 
described above since it differs from the usual definition of reliability 
in classical" 'test theory. The reason we do so is because it is consistent 
with the usual definition of system reliability that is Applied to engineering 

41 

problems (e.g., Barlow and Proschan, 1975; Marshall and Olkin, 1979, p. 402). 

The purpose df this section is to list some results about p.. ExceptT 
for Theorem 6, these results are not new, but they are not typically applied 
to measurement problems, and so we describe them here for the convenience 
of the reader. 

First we note that if x. is independent of x., i^j, 

) , 

n x. 1-x- 

P k = z n p. ^i-p ) - 1 (l) 

K x:S>_k i=l 1 1 
where x=(x^-. . ,x n ) and S=EX.. For some" cases, (1) is easily computed, 



for example, when-n is small or,k=ri, but frequently P|< js difficult to 
calculate. Another and perhaps more serious problem is that the x.'s might 
not be independent. In this case, detenftfning Pfc is more difficult. For 
these reasons, efforts have been made to find ways to- approximate p^, and 



to determine its properties'. 



Theorem 1. is s£r?is£ly increasing in each p i . A proof is given 1 
by Barlow and Proschan (1975, p. 22). ^ 
Theorem 2. Ifcov(x., ^)>0, # 1i»j, then ' 



' n n 



; - ^ P . <p k < i - ^-(i-p.) . . 

This is a special case of a result given by Barlow emd Proschan $975, p. 34) 
Jheorem'3. If cov(x. x.)>0, « / 

, t 

i 

max n p. x i _< p. 
•x:S=k*i=l 1 k 

This follows from Theorem. 3. 9 in Barlow and Proschan (1975, p.' 37). * 

Definition: For any' vector a_, let a^j la^)-" - a (n) be tne ele ments 
of a, written in descending order. The vector a^ is.'said^to be majorized by 
the vector h_ (b majorizes a) if 

. k k 

i f 1 a (i) *(i) ' 1 =1 »---»"-l ' ^ 

> 



and 



n n • 

E . a m = A b m (2) 



i=r (o m u -(!) 



"10 



8 



Symbolically, b_ majorizes a is written z<%. Tf (2) is replaced by 



b_ weakly majorizes ai, which we write a< wm b. 

Theorem 4. Suppose we have two test forms tyhere p. is defined as 
* 

before and p. is the corresponding probability on the other test. Suppose 
x. is independent of x^, i£j, for both test forms. Let r.=-log p., and 

* * * m 

r^—Logp^, where log is the natural logarithm. If r < r, then . 



with equality holding when k=n. A proof is given by Pledger and Proschan 
(1971). 



A corollary of Theorem 4 is that P k (jD.) >p k (p s ,.. .p G ) = E (")p G (1-P Q ) 

|*,n }- -. x=k - 

for any p_, where p r = n p. n is the geometric mean of the p.'s.* 

H=l I . 1 

Theorem 5. Pledger ana Proschan (1971) also show that" if R.=(l-p". )/p. , 

i » , i i 



n-x 



*. * 



and R? = (1-pT)/P|, then !L£ m B- implies that 



■* 

P k (EL ) 1 P k (p> 



We should remark that r< m r does not imply that_R<'V', nor is the 
converse true. 



\ 



\ 



103 



* * 

Theorem 6. Let r, r , R and R be defined as in, theorems 4 and 5. 
'Suppose x. is independent of x^, i^j, for both test forms, that r does not 
majorize jr , but that for some c, r> V= r - c, where c> 0(i=l,...,n). 



Th 



en p k (£) £P k (p.) for any k. v The samefis true -if B"^8.' = £.~~ c for some c, 



c^O. 



Proof : Theorem 4 s/ays that r>V implies that P k (p_) ^P k ,(p_')- Also, 
* * 
L'~ L ~ jc means that = p.. + b. for some appropriately chosen b^O, 

where pj isthe p. value corresponding to rj. Thus, by Theorem 1, 

P^(£')>Pjj(£ )-• I he - proof is. exactly the saiDe for k and . 

Theorem 7. If x.. is independent of Xy ifj, Hoeffding (\956) shows that 



P k (P) >.I (-)-p J (l-p)"" 0 , k < np 



and 



P k (£) 5 2 (") p" (l--p-) n - J , k-1 > np 
j=k ^ 




Applications 



where p = n _1 lp. . (See, al^, Gleser, 1975.) 



As pre liicated, the purpose of this paper is to illustrate 
how the' above^PKrems can be applied to certain measurement problems. 
This we now do. 



J 104 



10 



Example 1: Suppose multiple-choicer test items are being used, and that 



we want 



P|J > P . (3) 



for some positive P. <1. What is the minimum number of distractors required. 

To solve this problem, suppose guessing is at random, and that an ■ 

examinee behaves as assumed under the answer-unti 1-correct scoring procedure 

described in Section 2_. As was pointed. out; for. fixed c» ts maximized 

when guessing is at random. Furthermore, the value of p. is an increasing 

function of z t the proportion of examinees who know the answer to the i— 

item. Since ^or any i, is unknown, first consider the value of 

that minimizes p^. This is z = 0(i-l ,...,n). Suppose x^ is independent 

of x., ij*j. If the same number of distractors is used for each item, then 

J I 
P 1 = P 2 = --- = P n =P» say, and 



P k = I (") P X (1-P) n ~ x (4) 
k x=k x ' 

Let Pq be the value of p for which (4) equals P . Then the number of re- 

-1 -1 
quired distractors is t = (l-p n ) since, when e=0,p=l-t . For example, 

if P = .93, n=10, and k=8, then p Q « .9 and t=l(L Of coarse, in practice, 

this is an extremely large number of distractors. However, c^=0 (4f=l ,n) 

is highly unlikely, and so in reality, a smaller number of distractors • * 

would be needed when guessing is at random, . 



105 



To illustrate Theorem 3, consider the more general case where 
covU.j, x^)>0. If again p^.^p^p, to guarantee. (4) we determine p such 

that p = R* . From Theorem 3 it follows that the required number ^of dis- 

N 1 - * 8 

tractors is (1-p) \ If, for example, P =.93, n=10 and k=8, then (.911) =-93, 

and so t=lll. To reiterate, this value of t is based on an unrealistic 
value for the c v ' . .More realistic situations are considered below* 
pur goal here is to illustrate Theorem 3 in a simple manner. 

Example 2: We consider^the same situation as in example 1, but we 
assumfc information about the £'s \s available. More specifically, suppose 
the c's have been estimated to be ?^=.5, ?^.6, ^ 3 ^ 4 ^.75, ?^ 5 ^ 
? (6L ? (7) =#85 ^ c (8) s g and ? (9) =c (10)^ g5 ^ To determine t ^ e minimum number 

of di&tractors, again assume guessing is" at random, that cov(x. ? x ? )^P» and k=8. 
To simplify the illustration, suppose the same number of dis tractors i^ to* 
be used for each item. Since is an increasing function of c, and sUce 
when guessing is random Pj*C+(l-c)(l-t ), we have, by an application of 
Theorem 3, that a lower bound to p g is 



(.75+.25(l-t" 1 )) 2 (-85+1.5(l-t'')) ,3 (.9+.l(l-t~ l ))(.95+.05(l-t" , )r (5) . 



-1 



•1xy2 



Thus, we can guarantee pg>P by' finding the smallest t such that (5) is' 
greater than or equal to P . Table 2 gives the value of (5) for*t=4(l)8. 

TABLE 2 . * 

Values' of (5) for t=4(l)8 and k=6, 7 and 8 

t ' 



L 

\ 


4 


5 


6 


7 


8 


8 


.745 


.79 


.82 


.85 


.87 


k 7 


.79 ■ 


.83 


.85 


;88 


> 


6 


.85 


.88 


.90 


.91 


00 

• ml* 



106 



12 



These results are more encouraging than those in example 1, but having more 
than 4 or 5 equally attractive distractors premises' to be difficult in prac- 
tice, j , 

We note that the lower bound to irl^^orem 3 can be very sensitive 
to 4:he value of k. Table 3 also gives the value of (5) for t=4(l)8 for both 
k=7 and k=6. 



♦ 

Next suppose that x^nd x^ are intiependent, i^j. -From the corollary 
to TheoremA, a lower bound to pa is '• ' • . - > 

o 



10 10 X » 

^ O p e (' " p g» 



1.0-x 



(6) 



Since we are still .assuming guessing is^at random, p^C+O-OO-t" 1 ) and 
equation (6) is easily calculated for any t. The values of p G corresponding 
to t=2,M are respectively, .894, .942 and .948. Substituting'these 
values in (6), it follows that for t=2, Pg>.915, for t*3, p^.983, and for 
P 8 >.987. 



107 



13 4 * 



v, 



j*If\instead we app^ly Theorem 7, the values of p corresponding to t=2,3 
and 4 are\.89*75, .9317, ^and .9488, and the resulting lower bounds to p 0 
are .925, .973, and .988, respectively.. As is "evident, , the' lower bound 
• to -p 8 for the '.case t?2- is. higher than it was using the corollary. to, Theorem. 
4, but for t=3 and 4, the', l.ower bounds ^are about the samec 

Irt contrast to the, previous illustrations, test accuracy is very high 

i — 

i v * 1 * • 

• I using a "normal'' number of dis.tractors. An interesting .feature of the 

illustration jhst given is that there seems to be little reason for using 
t=4 dtsfractors , rather than t=3, since, the increase in p Q is minimal at' 
besrt. Note, however, that i was derived under the ,as sumption of random 
guessing. If -examinees have partiaTinformation, 'the p. values will be 
lower which in turn will lower the value of p 0 ./ As mentioned' in section 
2 >; an answeir-until -correct scoring procedure can be used to cHeck for 
partial information, and to estimate the p. f s. 

. -Example 3: The situation is assumed to be the same as in example^, 
" except that we want to allow for the possibility of having a different 

number of distractors across items. Assuning x, is independent of x., i^j, 

* . J ' 

the simplest approach to guaranteeing Po>P is to determine the smallest 

* . o — ^ 

t for each item such that p^pp) 1 ^" where p Q . is the value of p G in. the 

corollary to Theorem 4 such that E (")P P (l-PJ n *" x ' * P 
* T ' x=8 x 6 G 

P =.95, p Q =.915. It follows that for £=.5, .6, .75, .85 



If, for example, 
9 and .95, the 
(We assume that 



corresponding values of t are 6, 5,5., 3, 2, 2, respectively. 

i 

a minimum of t=2 distractors /are used.) . ' > 

For t=3 in.ex^mple 2, assuming x i is independent of x.., and that 
guessing is at random, p_= (.834, .871, .92, .92, .952, .952, .968, .984, '.984} 



108 



implying that P G =.'942 and so pg>.983. In example 3, using' the indicated 

* t values,' p is given ty p'=(. 917, .92, .95, .95, .952,;. 952^ ,952', .95, ,975, .975), 
P G =.959, and so Pg>.993. This suggests that the k- out of'n reliability 

.with the latter test fornr is higher than the firsf— but this has not been 
established. The, corollary to Theorem 4 gives a lower bound .to p,,, but it . . * 
has not been shown that the lower bound indicates which- test form is more 
accurate. If it had beeri true that P^P-j (1=1 »- JlO)'» the test form in 

* example 3 would be more accurate according to Theorem 1, but it is. evident, 
that this is not the case. However, by applying Theorem 6, it can be shown 
that- the latter test form has a higher Value for p g . 

'*Examp1e*4:' As mentibned in sectiori 2, Macready and Dayton (1977) ■ * 

examine a latent structure model that can be used to estimate p.. Included 

* in their discussion is a solution to the following problem: When measuring 
a particular skill, h<?w manyMtems are needed,, and what passing should we 
vJe] so that the probability of making a^XHe^ect decision about whether a 
typical examinee'has acquired a particular skill, i.e., the value of p, is 
reasonably close to one. % 

' As before, let Cbe the. proportion of examinee^ who have acquired the 

skill, and for a jranHomly selected examinee, let a=Pr (in. correct responsel 

examinee knows) and let p=Pr( correct response {examinee does not know). 

* f 

Suppose n equivalent items are to be used, to measure a skill. "Macready 

and Dayion provide a table of n values and passing scores correspiari^ing to 

various values of 5, a, and p. * For example, *fP?S.6, and &=.3, and 

if we want, with probability at least- ,95, to correctly determine whether 1 

a randomly selected examinee has acquired the skill 'being measured, Table I 



in Macready and Dayton (1977) says we need to use n=4 items with^a passing 
score of 3; 

Usi^^ie-Fe^ults in section 3, we can extend the technique proposed 
by Macready and Dayton to' tests *hat measure m skills,. As a simp"^ illus-r 

tration, suppose we haVe 4 skills, aTid the number of items (and passing 

* 

score) corresponding to these four skills are 4(3), 5(4), 4,(3) and 7(5), 
respectively'. For the' first skill, for example, we have' four items, and 
if an examinee gets at least 3 correct, we .decide he/she has acquired the 
skill. Further, suppose that the estimation procedures described by Mac- 
ready and Dayton are applied, and the four estimates of £ are .4, .5,' .6, 
and % 75; the corresponding estimates of a are .05, .1, .05, and .1; and 
^l^the estimates of 3 are .2, .3, .4, and .4. The mimm^m probability of a 
^correct decision associated with the four skills can be read from Macready 
and Dajfton's Table I (assuming a loss ratio ofjne) , and they are .95, .9, 
.9, .95. Making the appropriate independence assumption, Theorem 7 says 
fhat, for a randomly selected examinee, a lower bound to the probability 

of making at least 3 correct decisions for the four skills is .97. 

^» 

• Example 5: K proficiency test is designed to measure m skills. For 
each skill, a decision is niade about whether an examinee knows the correct 
response. How many items per skill do, we need so that for the m skiljs, 
« at least k correct decisions are made for the typical examinee. . This pro- 
f blem is similar to previous* illustrations. It can be solved using the 
results in section 3 in conjunction with the techniques described by Mac- 
ready and Dayton. v 



no 



16 

Example 6: Suppose every examinee behaves as described in se'ction 2 

under the answer-until-correct scoring procedure, .and that x. is indepen- 

dent of Xy i7j. Let y^Ex^p. be the expected number of correct deci-. 

sions on an n-item test, i.e., the number of times we expect U correctly 

determine whether a typical examinee knows the answer to an item/ He . 

consider the problem of determining whether y is reasonably large, say 

greater than or equal y 0> a known constant. For a random sample of N 

examinees, let w^=0 if the jth_ examinee is correct on the second attempt 

of the ith.item; otherwise w,,=l. From' section 2, Ew..-p., j=l* ,N , 

anq^so we decide y^Yq if otherwise- we deSfcte y<y 0 - How 

^ 3 • 

large must N be so that we can be reasonably certain of (paking a correct 
decision about whether y is greater than or less than y 0 ? 

Note* that the situation is similar to one considered by Fhan£r (1974) 
and Wilcox (1979). The maifi difference is that rather than a binomial 

* * 

.model, here we have a compound binomial distribution. 

tr 

Following Fhaner (1974) suppose we want to choose the smallest N so 
that when Y^ n +6* * 

r s - * « 1 (6) ' 

Pr(y>Y 0 )>T 

and when y<Yq-<5*> . . • y 

Pr(y<Y 0 )>T . (7) 

where ^<T<1. Fhaner assumes 5*>0, but we require 6*>i so tfefat we can apply 
Theorem 7. In particular, for y>Yq+S*, and for N=l, - 





f -*1 

Y 0 +6 


X 


f 

n-Y 0 -6 


n-x 


(8) 




n 




I n J 







where CyqII is tnc * smallest integer >y q . Again applying Theorem 7 (the 
second inequality), we have that for . 



in 



Pr(Y < Y 0 ) = 1"Pr Y ■! > Z (") 



r Yo -« i 



n-x 



n 



17 

(9) ' 



Let n-j and n 2 be ^ e right-hand side values of (8) and (9) respectively.' 

From the above results', we have that for^a random sample of N examinees , 
* 

when Y^o 4 " 5 



N 



(10) 



and wher> 



Y1T 0 -Cfi 



[ny 0 M 



Pr(?<T 0 )> Z (J)'^ (l-^) N " y 

y = 0 . y ' . ; ' 



(11) 



Thus; v/e can guarantee both (6) afid (7), regardless of the actual value 
'of y^by choosing, the smallest N so th§t the right-hand side of both (10) 
and (11) are greater than or equal to T. . 

As a more specific example, suppose we have an n=10 item test, that 

* ■ ■ * ** 

\jdLX\ answer-until-correct scoring protedure is to be used to estimate y and^ 
we want to determine the minimum number of examinees we aeed in order to 
correctly determine whether y is above or below Yg = 7* In particular, sup- 
-pose 5 =1, and that if y>y 0 +$ we want Pr(y>y 0 )>T=.9, and if y<y Q -6 we 
want Pr(y<Y 0 >T, regardless of the actual, of y. 



qo 



1<h 



From (8) and (9), n , = Z ( • v ).8 x .2 r, " x =.897 and n ,= Z ( ' u ).6 x .4 n "?=.618. 

' "_. 1 x=7 , x . r x=o * 

Substituting these values,. into (10) and (11), i 4 t can be verified that the 
mipimum N required is- 67.* -» 



-10, „n-9 



112 



References 

, F. B.,.& Hubert, L. J. Inference procedures for ordering theory. 
-Journal of Educational Statistics. 1977, 2, 217-233. 
low, R." E., & Proschan, F. Statistical theory of reliabilit y an d' life 

* • 

testing: Probability models . New York: .Holt, Rinehart & Winston, 
1975. - 

ayton, C. "M. , & Macready, G. B. A prdbabalistic model for validation of 

behavioral hierarchies. Psychometrika ^ 1976, 41, 189-204. 
Fhaner, S. Item sampling and decision-making in achievement testing. 
British Journal of Mathematical and Statistical Psychology . 1974, 
27, 172-175. ' 

*Gleser,^L J. On the distribution of the number of successes in independent 

trills. Annals of Probability . 1975. 3.' 182-188. 
yartke, A. R. The use of latent partition analysis to identify homogeneity 

'of an item population..' Journal of Educational Measurement , 1978, 15_, 

43-47. • 
Hoeffding, M. On the distribution- of the number of successes in independent 

trials. Annals of Mathematical Statistics . 1956, 27, 713-721.. 
Macready, G. B., SLDayton, C.' M. The use of probabilistic models in the 

assessment; of-jjiastery. Journal of Educational Stati sties' . 1977, 2_, " 

99-120. • 1 \ 

Marshall, A. , & 01 kin, I. Inequalities: Theory of ma.iorization and 

' its applications . New York': Academic Press, 1979. 
Pledger, G. , 8 Proschan* F. Comparisons of order-statistics and of spacings 

from heterogeneous distributions. In 0. S. Rustagi (Ed.) Optimizing 

Methods in. Statistics . New York: Acadeijrrfc: Press, 1971, 



113 



. 19 

Wilcox, R>R. Solving measurement problems with an answer-antil-correct 
scoring, procedure. Center for the Study of Evaluation, University of 
California, Los Angeles, 1980. 1 ' ff - - 

Wilcox, R. R. Applying ranking and selection techniques to determine the 
length of a mastery, test. Educational and Psychological Measurement , 

1979, 31, 13-22. ' 

Wilcox, R. R. The single administration estimate of the proportion of 

agreement of a proficiency t$st scored with a latent structure model. 
Educational and Psychological Measurement , in press, (a) 

Wilcox, R. R. Some results and comments on using latent structure models 
,to measure achievement. Educational and Psychological Measurement , ■ 

1980, in press, (b) 




114 



BOUNDS ON THE X OUT OF H RELIABILITY OF A TEST, AND AN 
EXACT TEST FOR RANDOM GUESSING 



Rand R. Wilcox 

Department of Psychology 
University of Southern California 

and 

The Center^for the Study of Evaluation 
University of California, Los Angeles 



t 



115 



ABSTRACT 

i 

Consider an n-item multiple choice test where it i$ decided that 
an examinee knows the answer if and only if he/she gives the correct 
response. The k out of n reliability of the test, p k , is defined to 
the probability that for a randomly sampled examinee, at least k 
correct decisions are made about whether the examinee knows the answer to 
an item* The paper describes and illustrates how an ^extension of a 
recently proposedlatSnt structure model can be used in conjunction 
with results in Sathe et al. (1980) to estimate upper and lower bounds 
on p k . A method of empirically checklng^the model is .discussed. Included 
is an exact test of whether guessing is at random* 



116 



Consider a randomly' sampled examinee responding to a' multiple- 
choice test item. In mental test theory there are, of course, many 
procedures that might be used to analyze this item. One approach might 
be as follows. Suppose a conventional scoring procedure is used where 
it is decided that an examinee knows the correct response if the correct 
alternative is chosen, and that otherwise the examinee does not know. 
If it were possible to estimate the probability, t, of correctly deter- 
mining an examinee's latent state (whether he/she knows the correct 
response) based on the above decision rule, this would give an indication 
of how well the item is performing for the typical examinee. The obvious 
problem is that under normal circumstances, there is no way of estimating 

this probability unless additional assumptions are made. One approach 

j 

is to assume that examinees guess at random among the alternatives when 
they do not know the .answer. If this knowledge or random guessing model 
holds, t is easily estimated. However, empirical investigations (Bliss, 
1980; Cross & Frary, 1977) suggest that this assumption will frequently 
be violated, and some related empirical results (Wilcox, 1982, in press a) 
indicate that such a model can be entirely unsatisfactory for other reasons 
as well . — 

Another approach is to use a latent structure model, and many such 
models have been proposed for measuring achievement (e.g., Brownless & 
Keats, 1956; Marks & Noll, 1967; Knapp, 1977; Dayton & Macready, 1977, 

1980; Macready & Dayton, 1977; Wilcox, 1977a, 19776, 1981a; Bergan et al . , 1 

t 

The choice of a model depends on what one is willing to assume in a 

\ 

particular situation. These models make .it possible to estimate errors 

* > 
at the item level such as 



& = Pr(randomly selected examinee gives the correct response) examinee 
does not know) ^ p] 

which in turn yields an^estimate of t. An illustration is given in a 
later section. (For a review of latent structure models vis-a-vis 
criterion -referenced tests, see Macready and Day ton, -1 981 .} For some 
recent general coimients on using latent structure models to measure 
achievement, see Molenaar (1981) and Wilcox (1981b). 

Assume for the moment that for each item on an n-item test, an 
•estimate of % can be made. Let x 1 = 1 if a correct decision is made on 
the ith item for a randomly selected examinee; otherwise x^ = 0. Then 
E(x^) = -t.j (i = 1 , S^.^n^is the probability of a correct decision on 
the ith item where the expectation is taken over the population of 
examinees. 

Within the framework just described, how should an n-item test be 
characterized? An obvious -approach is m to use 



121 



which is the expected number of correct decisions among the n items. 

Knowing y might not be important for certain types of tests, but 
surely it is important for some achievement tests. However, even if ) 
m is known exactly, it would be helpful to have sdme additional related 
information about Ex"j. For instance, a, test constructor would have-a y 
better idea of how the test performs if VAR(ex.) could be determined. 

* . 

The problem is that VAR(zx.) depends on C0V(x ,x.), but "this last quantity 
is not known, and at present there is no way of estimating it. An 
alternative approach is to use the k out of n reliability of the test 
(Wilcox, in press b) which is given by 



118 



T • 

p. » Pr(zx. > 'k) . . [3.3 

* . t 

In other words, if the goal of a test 1s to determine which of n items 

an examinee knows, and 1f a conventional scoring procedure is used, p. 
is the probability of 'making at least k correct decisions for the typical 
examinee. i 

Suppose,, for example, n = 10 and y is estimated to be«7. Thus, the 
expected number of correct .dec Is'ions is 7, but there is no information 
about the likelihood that at least 7 correct decisions will be made. 
If P k were known, a test constructor would have some additional and 
useful information for judging the accuracy of the test, P k might also 
b# used as follows. Suppose it 1s desired to have pg*> .9, If y 1s 
estimated to be 9.1 , this 1s encouraging, but it is not clear what 
implication$ this has in terms of making at least 8 correct decisions 
for the typical examinee. 

It 1s not being suggested that determining 1s important for every . 
test that might be constructed, but certainly it 1s important in various 
situations. For example, when measuring progress .through an instructional 
program, surely 1t is desirable to determine which of the skills represented 
by the items on the test have or have not been acquired by an examinee. 
An estimate of p ^ yields information about how well a test performs this 
goal . 

If x.j is Independent oyf x^, i f j, an exact expression for p^ is 
available via the compound binomial distribution. Perhaps there are 
situations where this independence might be assumed, but it 1s evident 
th$t this independence will not always hold. If it can be assumed that 
C0V(x^,Xj) :> 0, bounds on p k are available (Wilcox, in press b), Recently 
Sathe, Pradhan, and Shah (1980) derived bounds on p^, that make no 



lid 



assumption about C0V(x.,Xj). The main point of this paper 1s that these 
bounds can be estimatecfGsIng an extension of an answer-untn -correct 
(AUC) scoring procedure proposed by Wilcox (1981a). The paper also - 
indicates how an exact test can be made of certain implications of the 
new model. Tills procedure can also be used to make an exact test of 

* 

whether guessing is at random, (for aa asymptotic test,\ee Weitzman, 
1970.) Finally, the paper includes some comments on how a test might 
be modified when y pr is judged to £e too small . 

An Extension of an Answer-Until -Correct Scoring Procedure 

As just Indicated, an extension of results in Wilcox (1981a) 1s 
needed 1n order to apply the bounds derived by Sathe et al. (1980). 
First, however, it is helpful to briefly review the procedure and basic 
assumptions in Wilcox (1981a). ' 

v 

Consider a specific test ftem having t alternatives from which to " * 
choose, one of which is the correct response. Assume examinees respond 
according to an AUC scoring procedure. This means that examinees 
choose an alternative* and they are told immediately whether the correct 
response has been identified. If they are incorrect another response 
is chosen, and this process'continues until they are successful. Special 
forms are generally available for administering AUC tests which make 
these tests easy to use in the classroom. 

Let C t _i be the proportion of examinees who know the correct* 

response, and let (1 = 0 t-2) be the proportion of examinees 

who can eliminate i dis tractors given that they do not know* Wilcox 
(1981a) assumes that examinees eliminate as many distractors as they 
can* and then choose at random from among those that remain. If p. 

,y 120 



is the probability of choosing. the correct response on the ith attempt, 

then • ' «s 

t-i 

- p i = L ? J /(t " j) • » l \' W 

Note'that the model assumes that at least one effective distractor is 
being- used. Put another wa^jio distinction is made between examinees 
who know the answer and examinees who can eliminate all of the dis tractors. 
Assuming the model' holds, 

Vl = p l ■ P2 [5] 

and 

* - 1 " h • , . . ft] 

If in a random sample of N examinees, y. examinees are correct on their 

1 i 

1th attempt, pr. = y./N is an unbiased estimate of p. which yields an 
estimate of and t. 

Although empirical studies suggest that this model will frequently 
^ be reasonable (Wilcox, 1982, in press a), there are instances where this 
will not be the case. For example, some Items might require a misinfor- 
mation model, and an appropriate modification, of the AUC scoring procedure 
has been proposed (Wilcox, 1n press a). Further comments oh this . 
problem are made in a later section of the paper. 

Consider any two, items on an n-item test, say items i and j. 

t 

Applying results in Sathe et al. requires an estimate of t..=Pr(x =l,x.=l) > 

1J 1 3 

i.e., the joint probability of mak1n§-a correct decision for botff Items 
i and j. The remainder of this section outlines how this might be done. 

It is assumed that an examinee's guessing rate is independent over 
the items that he/she does not know. This means, for example, that 1f 
an examinee can eliminate all bu€ 2 alternatives on item Vand all but 

121 



3 alternatives on item j, the probability of choosing the correct response 

on the first attempt pf both v items is (l/2)(l/3)&= 1/6. ■ 

For the^wo items under cons'ideratioq, let p km (k, m = 1, t) 

be the probability that a randomly selected examinee chooses the correct' 

response on the kth attempt of the first item, .and the correct ^response 

on the mtb attempt* of the second. If c gh is the proportion of examinees 

who can eliminate g d'istractors* r frpm the'first item and h distractors 

from the second (g, h = 1, ~t-l), then 
t-k t-m ~" V 

p km = J I ^-/[(t - i)(t - j)] . ■ [7] 

m 1-0 j=0 1J . 

The last expression can be used to express e. , . , in terms of the p, 's 

t-i,t-i r km 

which can be use'd to estimate e t _ 1>t _ r Note that If the first item has f 
alternatives, V f t, simply replace t-k with t'-k in equation- 7. 

To clarify matters, consider the special case t = 3. Equation 7 *"N^ 
says that , 

» * 

hi = 5 22 + 5 21 /2 * W 3 . + 5 12 /2 + 5 11 /4 + W 6 ^ V 3 W 

+ *oi /6 + 5 oo /9 ' ' 

P 12 = C 21 /2 + C?0 /3 ♦ C n /4 + C 10 /6 + C(J1 /6 + c 00 /9 [9] 

Pl3 = W 3 ' +5 # 6 + W 9 ' t \S -no] 

P 21 » C 12 /2 * C Q2 73 * c n /4 * C(J1 /6 + c 1(J /6 * c 00 /9 [11] 

P 22 ; 5 n /4 + c 10 /6 + Coq/9 - [12] 

P23 = 5 10 /6 + W 9 03] 

P31 =5 02 /3 *^01 /6 - + W 9 . , ^ 

P32 = 5 01 /6 + W 9 ' - ' " /- ' ^5] 

P33 = W 9 • " . 06] 

^ 122 



Thus, starting with equation 16 
f 



V 



? 10 = 9p 33 . " C17] 

^ ? 07 - 6(p 32 - p 33 ) ^ . ^ . _ [18] 

■ and eventually e 22 can be expressed in terms of the P k J%. Replacing 

the p km 's with their usual unbiased 'estimate yields a/ estimate of ? 99 

say s 2 2« But it can be seen that for the two items under consideration 

(items i and j), y \ ^ 

. : T 1j = ? 22 + 1 - Pll ' - [19] 



Replacing c 22 and p^ with 5 22 and'p^ yields an estimate of = Prfx^ x^l), 
say t^. For arbitrary t, t^ is given by .equation ^9 with ? 22 replaced 

with Vl t-T* * '* - ' 

• ™ Bounds on p^ ^ ■ $ 

° ' 

This section illustrates how the results in the previous section 
can be used to estimate bounds on p^. . First, however, results 1n Sathe 

# 

et al. (1980) are summarized.. 

Recall thakH = Ex. and let « 
ri-l. n 1 * 

s = I ' l *** • a [20] * 

. u k m * - k ' ~ *- • < ■ ' [2i] * ; 

and * ^» 

,V k '= (2S - k(k - l))/2 \ \ * : 

Then, » t' y 



• \ * i f§ n ( n - K + i) 

If 2V k _ ;| <-(n + 4 k - 2)U k ^, then* 



L23] 



. K (**.- k)(k* - k + 1) ... . .; ' **• 



where k* *+ |c - 3 is the largest Integer in 2V^/U. j . Two upper 
bounds on p k are also given. The flrst-n's 
P k < 1 + (CT+ k - 1)U R - 2V k )/kn 

and the second is that if 2V. < (k - 1)U' , 



•1-2 



_ik*M)U k - V k 



(k - k*)(k - k* + 1) 



^• p k 

where k* + k - 1 is the largest integer in 2V./U. . 



An Illustration 



[25] 



[26] 



To illustrate how P k might be applied-and interpreted, observations 
of seven items were analyzed according to the procedure outlined 
above. Each item had two distractors, and they were found to be 
consistent with the assumptions of the answer-until -correct scoring 
model. {See Wilcox, 1981a). Table 1 shows the observed frequencies 
for the first two items. The question to be answered is* if these 
seven items are taken to be the. who^e test, do they give reasonably 
^accurate information about what the typical examinee knows£ 



Generally, when estimating e 22 there is no need to estimate all 
6f the ?*s- in equations 8-06. For the situation at hand, 5 22 can be 
estimated as follows. First compute . 

*02* 3 = P31 " P32 " J, ' 

for the, data in Table 1, this is .107. Next compute, 

^2 /2 = P2l'-p22"V 3 ' * 

* * . - * * 

t 

which is .074. Then 



[27] 
[28] 



124 



hz = p n " h2 - V 2 -V 3 [29] 

r 

which is equal, to .225. Substituting, these values into equation 19, 
the estimate of t 12 is t ]2 = .75. Applying equation 6 to all seven items, 
it is seen that » = 5.434. In other-words, it is estimated that the 
expected number of correct decisions is 5.434. 

Next consider p 5 . The value of 's was estimated to be 16.929. 
From equations 20 - 26, this implies that 

.418<p 5 <.74. [30] 

* 

^ This analysis suggests that these seven items, taken as a whole, 
are not very accurate since there is at least a^26 percent chance of 
making an incorrect decision on three or more items. How should the 
test be modified? Another important question is to what extent can. 

/ it be improved? One approach to improving the test is to increase the 
^-''number of distractors, and another approach is to try to modify or 
replace the. distractors that are being used. The latter approach will 
be considered first. 

The initial step in trying to decide whether to replace or modify 
the existing distractors is to determine the extent to which they can 
be improved. This can be done with the A measure in Wilcox (1981, eq. 20). 
This measure is just the difference between the maximum possible value - 
of f and the estimated value given that c 0 = c 0 . Another related 

^ measure 1s the entropy function (see Wilcox, 1981a). This measures 
the effectiveness of the distractors among the examinees who do not know 
the correct response by indicating the extent to which p 2 , ,.p t are 
unequal. The closer they are to being equal,. the more effective are 
the distractors, i.e., guessing is closer to being random* It has been . 



pointed out (Wilcox, 1981a) that A might be objectionable as- a 
measure of the extent to which p " p t are equal, but for present pur- 
poses it woul^seem to be of interest because increasing p k depends on 
the extent to which ,t can be increased for each item. 

Referring to Wilcox (1981a>, a little algebra shows that for the 
case t * 3, 

A = (p 2 - p 3 )/2 . - , ' V [30] 

For item 1 in Table 1 , a = .024, and for item 2 it is .034 (a is assumed 
to be positive, and so if p £ < p 3> a is estimated to be zero.) 

^If the number of alternatives for item 1 is' increased to t = 5, 
and if guessing is at random, .then the value of t would be .893 which 
^represents aft increase of .126 over the value of t using the existing, 
distractors. Thus, it would seem that one approach to improving 
item 1 is to find two more distractors that are about as effective as the 
two being used. Of course in practice, this might be very difficult 
to do. 

Checking Certain Implications of the Model, and an r 
Exact Test for Random Guessing 

Suppose y lt ... , y t hav^ a multinomial' distribution with cell 
probabilities p ] , ... , p t where ry i = n and ip i = 1. This section 
describes an exact test of whether two or more of the p. 's are equal. 
In other words, the null hypothesis might be that.p. = p. for some i £ j, 
or that p i = p. » p k , etc. An important special case is the null 
hypothesis that 

" P 2 = P 3 - ... ■ P t • [31] 



126 



11 



When equation 31 holds for the AUC scoring model, guessing is at random, 
and the distractors are performing at their maximum possible/effective- 
ness among the examinees who do not know (see Wilcox, 1981a). - 

The main motivation for including this exact test in the present 
paper is that it is relevant when verifying certain implications of the 
new model described in previous sections. Consider, for example, equations 
8-16. They imply that various inequalities must hold which includes 

P ll > Pl2 ^ Pl3 ^ p 23 ^ P33 • ' ' 1 C323 

0$, 

An asymptotic test of equation 32 is already avaifable (Robertson, 1978). 
Suppose, however, the number of observations is moderate or small and 
.that, for example', p^ < p 12 or p 13 < p 23 < p 33> Then to" test the 
assumption.that p^ >^ p ]2 requires a test of p^ ='P-j 2 - In tne second 
case, the null hypothesis would be p 13 = p 23 = p 33 . Note, however, that 

if P ll < p 12 and p 12 > p 13 * p 33' a test of p n = p 12 and p 23 = p 33 1s 
needed, but that p^ 2 = p 13 would not be tested because p 12 > P 13 is 

already consistent with equation 32. 

The proposed test is based on the exact distribution of 

S = lyf . . [33] 

An expression for the probability function of ,S*was derived by Alam and 
Mitra (1981), but unfortunately their result is incorrect. (Prof. Alam 
has confirmed the error in a letter to the author.) A correction to the 
Alam and Mitra paper is in preparation which will include a correct 
expression for the probability function of S. To- illustrate how this 
distribution can be used to test the' implications of the model described 
in this paper, the distribution of S for k = 2 is given below. 

N - * 

n J 

127 ^ 



12 



Let a be the smallest Integer greater. 'than or equal to n/2, and let 

2 ' 2 



b be the largest Integer between n/2 and n%ch that b 2 + (n - b)* 5 s 
where s 1s an integer. If n is odd 1 

b 

I 

y=a 



PK5,s,=j a (^ (1 - P]) n-, + VjpfO-p., 



[34] 



If n 1s even, subtract j^ 2 Jp n/2 (l - p,) n/2 from the right-hand slde^f 
equation 34. 

For k >, 2 the exact distribution of S 1s given by a recursive 
formula that will appear 1n a correction to the Alam and M1tra paper. 
To Illustrate the proposed test, 1t 1s useful to also note that for k ■ 3, 
the joint distribution of y 2 and y 3 given y^ is binomial with parameters 
n "* y-\> P2^P " P]) and P3/O - P^- Thus, from equation 34, 



■ Pr(y 2 +y 2 < s|y ) = J 

■ y=a 



1-P, 



1-P 



n-y-j-y 



n-^-a 
y=n-y 1 -b > 



f n -y-) 



1M 


y 


[* p 3| 


n-y-j-y 






w 





[35] 



f n - y ' ] 


* > 

P2 


(n-y^/2 


r l \ 


(n-y,)/2 

t 4 


H 




H 



if n - y 1 1s odd, and 4f n - y 1 1s' even, Pr(y 2 + y 2 < sjy^ can be determined 
by evaluating the right-hand side of equation 35 and subtracting 

-[36] 

« 

where n - y^ replaces n in the definition of a. and b. 
To test the hypothesis 

H Q : P] = p 2 = ... = p k 

2 ) 

compute s = iy 1 and then compute Pr(S < s) under the assumption that H* 

is true. If t th1s last quantity Is small, say less than a, reject H Q . 
Note that from Marshall and Olkin (1979, p. 391) 1t follows immediately 



ERIC 



128 



13 



that this hypothesis* testing procedure is unbiased. (In other words, 
as the actual vector of p.. values moves "away" from Hg/the power of the 
test' Increases. \ t 

The procedure 1s Illustrated by testing to see. whether guessing 1s 
at random on one of the Items used above. The observed outcomes were 
y-j = 303, y 2 = 46, and y 3 = 21 . If guessing 1s at random, then, as 
previously indicated, p 2 = p r since p 1 do§s not play a direct role 1n 
the null hypothesis, the conditional distribution of y <2 and y 3 given y 1 
1s used. The null hypothesis is that p 2 /(l - p^) = p 3 /(l* - P j) = 1/2. ' 
Compute s = 46^.+ 21 2 = 2116. \ 

Ptfxf + x 3 < 2116j yi = 303) [37] 

is given by equation 35. Referring to tables compiled by Pearson (1968), 
equation 37 was evaluated to be .035 and so the null hypothesis would 
be rejected at the .05 level. 

Estimating When There Is Misinformation 

Among the 30Htems analyzed by Wilcox (in press a), the observed 
test scores suggest that two of the items do not conform well to the 
AUC scoring model described in a previous section. Thus, the proposed^ 
estimate of t^. is Inappropriate. This section illustrates how this' 
problem might be solved when a misinformation model appears to be more 
appropriate for some of the Items on the test. 

Consider a test Item with t alternatives, and let ?. be the pro- 
portion of examinees who eliminate the correct response from consideration 
on their first attempt of the Item. (An AUC scoring procedure is being 
assumed.) Once the examinee realizes that he/she has misinformation 



129 



14 



about the skill represented by the Item,, it is assumed that the examinee 
chooses the correct response on the next attempt. THis assumption is 
made here because it seems to give a good approximation to how examinees 
were behaving on the Items used in Wilcox (1n press a). It is also 
assumed that if an examinee does not know and does not have misinformation, 
then he/she guesses at ragdom among the t alternatives. Finally, for 
examinees with misinformation, assume that they believe the correct 
response is one of c alternatives that are in actuality incorrect. 
Thus, examinees wi fffmlsinformation will require at least c + 1 attempts 
before getting the item correct. As an illustration, consider t = 5 
and c = 3. Then, 

p l = ? t-l + W 5 [38] 
P 2 = C t+1 /5 [39] 

r 

P 3 - C t+1 /5 [40] 

P 4 = S t + W 5 ' " ,[41] 

P 5 = *f+l /5 ' > [42] 

i 

f 

where ? t+ ^ 1s the proportion of examinees who do not know and who do 
not have misinformation. 

Th>s-Nmodel gave a good fit to the observed scores in Wilcox (1n 
press a), but an. even more general model is possible. In particular, 
let y be the population of examinees who have misinformation and give 
the correct Response once they have eliminated c = 3 alternatives. Then 

p 4 = *4 + W 5 [43] 
P 5 = (1 - >k t + C t+1 75 [44] 



15 



Various modifications of the model are, of course, possible and 
presumably this model (with some appropriately chosen c value) will give 
a good fit to the observed test scores. For Illustrative purposes, 
equations 38 - 44 are assumed \o hold. The point of this section is that 
1t 1s now possible to again estimate x.. where the misinformation model 
is assumed to hold for one or both of the items in any Item pair. Note 
that for a single Item where equations 38 - 44 fiold, 

T = ? t-l + W* • " [45] 

To estimate x^, the joint probability of raking a correct decision 
on a pair of 1tems-fthere, say, the first Item is represented by a mis- 
information model, equation 7 must^be rederlved. Accordingly, let t' 
be the number of alternatives on the first item, and t is the number of 
alternatives on the second. The misinformation model assumes that on 
the first attempt of the item, examinees belong to one of three mutually 
exclusive* categories, namely, they know the answer and choose it, 
they have misinformation and eliminate the correct response, or they 
do not know and guess at random. Thus, using previously established 
notation, equation 8 becomes, 

p n = Hz + ? 4i /2t ' + W 3t ' + W*' + V 2t ' ^oo /3V 

where, in this Illustration, t' = 5. There is no $.1 term (1 = 0, 1, 2) 
because the misinformation model assumes that if examinees do not know, 
'they cannot eliminate any^of the distractors. More generally, 

p n = H'-i,t-i + j^f-i.jrt* -fl* 1 + i 0 ? oj /{t " j)t ' £47] 

Also 

p kl = p ll " c 42 {k = 2 > •••• *') [48] 



131 



16 

r- 

P 12 = C 41 /2f + S 40 < [49] 

ro 

Plm B i 0 5 4j /(t ".^ t '" = °> •••> t- 2 ) • £50] 

The remaining values can be determined in a similar manner. For the 

two items being used here - 
m 

• P 2m = .L 5 0j (t " j)t ' (m = 2 > •••» *) " ' [51] 

and \ 

p 3m = P2m ' - • * 

The expressions for p 4m and p 5m involve the- proportion of examinees 
who have misinformation on tne first Item. Let c. ,. be the pro- 

V J 

portion of examinees who have mislnformatlofrabout tne firstfitem 
and can eliminate j distractors on the second (j =0, ;.. , t-1). 
Previous expressions for the p^'s did not involve e t ., because the 
misinformation model being used assumes that examinees who have mis- 
information^ will get the item correct on their fourth attempt. 
Of course, as previously indicated, some modification of this model 
(i.e., some alternative value for c) will probably be necessary when, 
studying a different item -for which there is misinformation. The point 
is that the p^'s can be expressed in terms of the Cfj's. 
* The remaining equations needed for the present situation are 



"41 


■ Hi + hi'? + W 3 + *02 /5 + W 10 + W^ 5 


[52] 


p 42 


= W 2 + *50/3 + V l0 + S 2 0 /15 • 


f 

[53] 


P51 




[54] 


P52 


■«oi /10 + W 15 


/^[553 


P53 


= w. 15 • 


[56] 



•132- 



17 



Thus, starting with equation 56, * 00 can be estimated by replacing p„ 
with its usual unbiased estimate, and the remaining c's can be estimated 
in a similar fashion. This, 1n turn, yields an estimate of x . and so - 
bounds on p k car> again be estimated as was Illustrated in a previous 
section. 



Discussion 



One feature about p fe that might be disturbing 1s that generally It 
is an increasing function of the ?.'s, the proportion of examinees who 
know the 1th Item. Thus, one way to ensure that p fe is close to one 1s 
to use easy Items. This approach certainly is not being recommended. 
The view taken here is that the goal of the test is to determine which of 
n specific skills an examinee has acquired. The Idea Is that the student, 
or perhaps an entire group of students, ^can be given remedial work on 
those skills they have failed to learn. If P|t is small, and if 1t 
appears that adding effective dlstractors is difficult to do, this 
suggests that a conventional scoring procedure 1s inadequate, and that 
it should probably be abandoned. The possible replacements include 
using completion items, the AUC scoring procedure used here, or "" f 
one of the many latent structure models referred to at the beginning 
of the paper. These models make it possible to determine whether ^ 
Is small (e.g., Wilcox, in press). If it is small, perhaps all of the 
examinees should be given additional instruction. 



133 



18 



The results reported in this paper might also be useful when 
empirically checking the assumptions of other latent structure models. 
For example, Macready and Dayton (1977) and Wilcox (1977) propose models 
where it is assumed that pairs of equivalent items are available. Two 
items are defined to be equivalent if examinees either know both or neither 
one. When equivalent items are available, the proportion of examinees 
who know both can be estimated (assuming local independence). Macready 
and Dayton checked their model with a chi -square goodness-of-fit test, but 
this requires at least three items that are equivalent to one another. 
(When there are only two items, there are no degrees of freedom left.) 

For illustrative purposes, assume t=3, and consider equations 8-16. 
If two items are equivalent, then 

? 21 = ? 20 = c 12 = ? 02 ~ 0 [57] 

-?12 = p 21 = p 22 [58] 

p 13 = P.23 * ' [59] 

and 

p 31 = p 23 ' , [60] 

and an exact test of these equalities can be made using the procedure 
described in an earlier section. If one of these items is assumed to 
be hierarchical ly_related to the other, again certain equalities must 
hold among equations 8-16, and this can again be tested (cf. White and, 
Clark, 1973; D*yton and Macready, 1976). 



134 



Table 1 



Number of Examinees Requiring^ i Attempts on Item 
1 and j Attempts on Item 2 



Number of Attempts on 
.. Item 2 



Number of 
Attempts on 
Item 1 



1 

2 
3 



179 


26 


14 


76 


8 


4 


53 


13 . 


4 



135 



v References 

Alam, K., and M1tra, A. Polarization test for the multinomial d1 strh 
^J> bution. Journal afe£he American Statistical Association , 1981, 76, 
107-109. 

• 

Allen, M.J., and Yen, W.M.. Introduction to measurement theory . Belmont, 

CA: Wadsworth, 1979, 
Bergan, J.R., Cancelli, A. A., and Luiten, J.W. Mastery assessment with ^ 

latent class and quasi -independence models representing homogeneous 

Ttem domains. Journal of Educational Statistics , 1980, 5_, 65-81. 
Bliss, L.B. A test of Lord's assumption regarding examinee guessing 

behavior on multiple-choice tests using elementary school students. 

Journal of Educational Measurement , 1980, 17. 147-153. 
Brownless, V.T., and Keats, J.A. A retest method of studying partial 

knowledge and other factors influencing item response. Psychometrika , 

1958, 23, 67-73. 

Cross, L.H., and Frary, R.B. An empirical test of Lord's theoretical 

results regarding formula-scoring of multiple-choice tests. Journal 

of Educational Measurement , 1977, 14, 313-321. 
Dayton, CM. , and Macready, 6.B. A probabilistic model for validation 

of behavioral hierarchies. Psychometrika , 1976, 41, 189-204. 
Dayton, CM. , and Macready, G.B. A scaling model with response errors 

and^intrlnsically unscalable respondents. Psychometrika , 1980, 4^5, 

343-356. 

Knapp, T.R. The reliability of a dichotomous test*item:* A 'correlatlonless 
approach. Journal of Educational Measurement , 1977, 1£, 237-252. 

y 

Macready, 6.B., and Dayton, CM. The use of probabilistic models 1n 

the assessment of mastery. Journal of Educational Statistics . 1977, 
• Zj 99-120C 



. ' ' . « * References' * .^.V . 

Alam, K.,"and Mitra, A. Polarization test for the multinomial distri- 
bution. Journal -of the American Statistical Association, 1981, 76, /J^ 
107-109. . • - ' • ' ' . * 

Allen,- M.J *, and Yen,' W'.M, Introduction to measurement theory. Belmont, 

CA: Wadswor£h, 1979. ^ . * • " „ 

feiergan, J.R., Cancelli, A.A.,.and Lul-ten, J.W. Mastery assessment with 
latent class and quasi -Independence models represeatlng homogeneous 
Item domains. Journal of Educational Statistics , 1980, 5_, 65-81. 
Bliss, L.B. A test" of Lord's assumption regarding examinee guessing 
. behavior on multiple-choice test? using, elementary school students. 
Journal <of Educational Measurement , 1980,'l_7,, 147-153." ' 
Brownless, V.T., and Keats, J.A. A retest method of studying partial' 

knowledge and, other factors Influencing Item response.- Psyehometrlka , 
1958, 23, 67-73. . '•* \ 

Cross, L.H., and Frary, R.B. An empirical test of Lord's' theoretical * - © 
results regarding formula-scoring of multiple-choice tests. r Journal 
of Educational Measurement, 1977, .14,^313-321 . , 
Dayton, CM. , -and Macready, 6.B. A probabilistic model for validation^ 
of behavioral, hierarchies. Psychometri ka , 1976, 4]_, 189-204. ' . 
Dayton, CM., and Macready, G.B. A scaling model with response errors 
and Intrinsically unsellable respondents.- Psychometri ka , 1980, 45, 
343-356. . 4 - 

"Knapp, T.R. The reliability of a dlchotomous test-item: A 'correlatlonles^ .* 

approach. Journal of Educational Measurement , 1977,; 14j 237-252. 
Macready, G.B. , and Dayton, CM. 5 * The use of probabilistic models 1n ~~ 
the assessment of mastery. Journal of Educational Statistics . 1977, 
- 2, 99-120., •• ' ' • . a 

* - : . - - / " ' ■ 137 * - 



. 21 



Macready, G.B,, and Dayton, CM. The nature and use of state mastery '' 
models. Applied Psychological Measurement , 1980, £, 493-516. 

Marks, E. , and NoVl , 6. A. Procedures, and criteria for evaluating 

reading and listening comprehension tes^ts. Educational and Psycho- 
logical Measurement , 1967, 27, 335-348. 

Marshall, A., and Olkin, -I. Inequalities; Theory of majorization and its 
* applications . New York: Academic^ Press, 1979. 

Molenaar, I. On Wilcox's latent structure model for guessing. British 

Journal of Mathematical and Statistical Psychology , 1981, 34, in. press. 

Pearson, K. Tables of the incomplete beta function . Cambridge: University 
Prkss, 1968. * • 

Robertson, T. Testing for'and against an order restriction on multinomial 
parameters. Journal of the American Statistical Association , 1978, 
73, 197-202. 

Sathe, Y.S. Pradhan, M. , and Shah, S.P. Inequalities for .the probability 

of the occurrence of at least.m out of n events. Journal of Applied 

m - 

Probability , 1980, 17, 1127-1132. 



Weitzman, R.A. Ideal multiple-choice items. Journa l oi the American 

J *~ T 

Statistical Association , 1970, 65, 71-89. 



White, J*.T., & Clark, R.M*. A test of inclusion which allows for errors^ 

of measurement. Psychometrika , 1973, 38,-77-86. 
WilcoXj^R^R. New methods for studying, stability. In C.W. Harris, A. 

Pearl man, and R. Wilcox, Achievement test items: methods of study . 

CSE Monograph No, 6, Los Angeles; Center for the Study of Ev^ftu^tiorr, 

University of Cairfornia,.'1977. (a)^ 
Wilcox, R^R. N^nethods for studying equivalence. In C.W. Harris, A. 

Pearl man, and*R. Wilcox, AchievemgrfTtest items: ' Methods of study, 

CSE 'Monograph No. 6* Los Angeles: Center for the Study of Evaluation, 
» % Un1ver*1ty of California, 197/* (b) 



* * • 22 , 

. , . t 

Wilcox, R.R. Solving measurement problems with an answer-until -correct 
scoring procedure." Applied Psychological Measurement , 1981, 5, 
399-414. (a) 

Wilcox, R.R. Recent advances in measuring achievement: A response to ~ 
Molenaar. British Journal of Mathematical and Statistical Psychology , 
1981 , in press, (b) # 

Wilcox, R.R. Some empirical and theoretical results on an answer-un til- 
correct scoring procedure .^ British Journal of Mathematical and 
Statistical Psychology , 1982,~in press. ■ f~- 

Wilcox, R.R. Some new results on an answer-untiP^correct scoring procedure. 
Journal of Educational Measurement , in pressX (a) 

\ 

Wilcox, R.R. Using results on k out of n system reliability to stttdy 
and characterize tests. Educational and Psychol oqicaT Measurement , 
in press, (b) 

i ' 

Wikox, R.R. Determining the length of multiple-choice criterion-referenced 
tests when an answer-until -correct scoring procedure is used* 
Educational and^ Psychological Measurement , in press, '(c) 




v 



139 



DETERMINING JHE LENGTH OF MULTIPLE CHOICE 
. CRITERION-REFERENCED TESTS WHEN AN 
ANSNER-UNTIL-CORRECT SCORING PROCEDURE IS USED 

Rand R. Wilcox 



DEPARTMENT OF 'PSYCHOLOGY 
University of Southern California 
Los Angeles, California 90007 

, ' $nd*the 

. CENTER-FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University, of £*Hforni a, Los Angeles' 90024 




140 



tes^tf^results 



ABSTRACT ^ 

When determining the length of- a criterion-referenced 
in van den Brink and Koele (1980) apd Wilcox (1980a, 1980b) indicate .that 
the problem of guessing might be more serious than may have .been expected 
Recently, however, a new method of scoring tests was proposed that correct 
for guessing without assuming guessing is at random. Moreover, empirical - 
investigations suggest that the underlying assumptions of the new scoring 
procedure will frequently hold^ This paper indicates how test length 
.might be determined when the new scoring procedure is used. The results 
indicate that test length might be substantially reduced when the new scor- 
ing rule can be applied. * _ 



141 



1. INTRODUCTION :® 

Consider'a single examinee and a domain of multiple choice test items. 
Let t be the proportion of items the examinee' knows, and let p be the exam- 

inee's percent correct true score. In criterion-referenced testing a 

** - . * 

frequent goal is determining whether an examinee's true score is above or 
•% 

below a known constant, say ttq. Usually the problem is formulated in terms 
of p (e.g., Huynh,JL976; Wilcox, 1979), but recently attention has also • 
been given to the case where t is the true score' of interest* (e.g., 
van den Brink and Koele, 1980; Wilcox, 1980a). 

A basic problem with criterion-referenced tests is determining how 

> 

many items to include on the test.. Exiting- solutions are summarized 
by Wilcox (1980b). (See, also, Berk, 1980.) Although considerable progress 
has been made, serious problems remain. *The main difficulty can be summar- 
i zed briefly as follows: *When the test lengthy problem is formulated in 
terms of p, and a single examinee, the solution proposed by Fhaner (1974) 
may result in a. test that is not overly long/ However* if the problem is 
posed in terms of x, and if guessing is assianed to be at random, van den Brink 
and Koele (1980) show that the test may have to be ^substantially longer to 
guarantee the same level of test acfcuracy as is obtained when the problem 
of guessing, can be ignored. Wilcox (1980a) notes that the problem is^much 
worse than indicated by van den Brink and Koele: This is not surprising 
because there is no particular reason to assume random guessing, .and empir- 
ical studies verify that such an assumption might be unreasonable (Bliss, 
1980; Cross and Frary, 1977). ^ 



142 



3 



Wilcox (1980b) indicates that the problem of guessing might be par- 
tially alleviated when latent structure models can be used to estimate t, 
but there are clearly situations where such models are inappropriate' 
(cf. Me^enaar, 1981;, Wilcox, 1981a). The result is that if multiple choice 
test items must be used, an urjrealisticalty large number of items might be 
necessary in order to be reasonably certain of correctly classifying an 
examinee whose true score x is close to the criterion score 7Tq. ' 

This paper extends 'existing test length solutions to situations where 
an answer-until -correct scoring procedure can be used. An advantage of the 
new solution is that it corrects for guessing without assuming guessing 
is at random. In addition, the new results represent a substantial im- 
provement over existing techniques when multiple-choice test items are 
being used/ 

An Answer-Until -Correct Scoring Rule 

Wilcox (1981b) proposed an estimate of % based on an answer-until- 
correct scoring procedure. This subsection briefly reviews the assumptions 
and Justification for using this scoring rule. 

Consider a multiple-choice test item with t alternatives, one of which 
is correct. An answer-until-correct test refers to situations where an 
examinee choose/ alternatives until ''the correct one is identified. Thts 
is usually acx^mp/ished by having examinees erase a shield on an answer 
sheet until the Correct alternative is chosen. 

For a specific examinee and a randomly chosen ^Item, let p. be the 
probability that the correct answer is chosen on the ith attempt (i.e., the. 



143 



probability of i erasures is p^). Wilcox makes certain assumptions about 
how an examinee behaves when attempting an item, and in terms of the p^s; 
these assumptions imply that i 

Pj > ?z - ••• - p t ' ^ 

Empirical investigations made by Wilcox (1980c, 1981b) suggest that the 
inequalities in equation 1 will frequently holdL For results on how to 
characterize n-item tests, see Wilcox, (in press)* For a strong true scoi^e 
model, see Wilcox (1980c). 

If equation 1 is assumed, a maximum likelihood estimate of x is avail- 
able via the pool -adjacent-violators algorithm (Wilcox, 1981b). Here, 
however, there is no loss'in simply using the unrestricted maximum likeli- 
hood estimate which is 

t = (x 1 -x 2 )/n [2] 
where x.(i=l,2) is the number of items for which the examinee is correct on 
the ith attempt (i.e., the number of times the e^minee erases i shields), 
and n is the number of items on the test. The appeal of equation 2 is 

that it estimates t without assuming guessing is at random, and as was pre- 

* 

viously noted, there is some empirical evidence that it is justified. 
It should be noted that x is theoretically justified because x .can be 
shown.to be equal to Pj-pg when the assumptions in Wilcox (1981a) hold. 

2. DETERMINING 'TEST LENGTH 
This section extends the test length solutions of Fhaner (1974) and 
Wilcox (1979) to the answer-until -correct scoring procedure outlined above. 
As in Wilcox (1981b) it is assumed that x n and x 0 have a multinomial, 
distribution. 

Mr 

144 



Consistent with previous test length solutions (Wilcox, 1980b), the 
goal is to determine the smallest n so .that when t<tt 0 -6* or when t>tt 0 +5* 
the probability of a correct decision (PCD) is at least P* where ^<P*<1 ' 
and 5*>0 are predetermined constants. In this section the decision t-mtq 
is made if t>itq; otherwise the reverse decision is reached. By convention, 
either decision about t is said to be correct when x is in the open inter- 
val U 0 ~ 5 *> *o + **^ This open lnterva1 is called the indifference zone. 

When t>itq-6* the rule for deciding whether t is^above or below 7Tq 
means that • . 



where A={Cx 1 ',x 2 ): Xj-Xginirg}, Ex.=n, q=l-p 1 -p^and where x.>0 (1=1,2,3) . 
When t£7Tq-5* , ^ 

PCD - l x^Un-x^-x-,)! V V ^ ' ■ <4) 

where B^lCx^^): Xj-Xg^irgh 
To guarantee 

PCD > P* * (5) 

when t>ttq+6* or when t<7Tq-6*, we consider, as is typically done, the 
worst possible case.^ That is, the value of t is determined that minimizes 
equations 3 and 4. Then the smallest integer n 1s found so that PCD>P*. 
It follows that equation 5 is satisfied for any value of T=pj-p 2 not in 
the indifference zone. 



145 



Since the conditional distribution of Xj given x 2 is binomial, it can 
be seen that for t>tt 0 +6*, the PCD is minimized when t=tt 0 +5*, and when 
t<tt 0 -6* the mi nimmm occurs when r=n Q -6*. Consider, for-example, the case 
t>7t 0 +s*. The probability of x 1 given x 2 can be written as 



V 



f(x 1 |x 2 ) = 



n-x 2 




' P 2 +T ' 


"1 


l-2p 2 - T 






1_P 2 

^ J 







Thus, the PCD is equal to 



0-x, 



•C nir <Jn 



x 2 =0 



2 + f ( x i 
x 1 =[x 2 +mr () ] ] 



l* 2 > 



<6) 



where [x] means the smallest integer greater than •or equal to x. The 
term f(x^\x 2 ) is the only one that depends on the parameter t. Also, 
for each x 2 , and fixed p 2> the second summation is an increasing function 
of t (see, e.g., Wilcox, 1979). Thus, the value of t that minimizes equa- 
tion 6 with the restriction that x>ir 0 +6* is T=ir 0 +d*.' The case t<tt 0 -5* is 
handled in a similar fashion, and in particular, the minimum PCD occurs 
•when t=7Tq-s*. 

There remains the problem of determining the exact values of and 
p 2 that minimize the PCD when Pj-P 2 is equal to or irg+6*. An 

exact solution is not "given, but it is possible to further-limit the possible 
values of pj and then* to use numerical techniques to solve the problem. 
First suppose t=tt 0 -6*. ^Since p 1 +p 2 +q=l, 2p 1 +q=l+-rr () -6*. It follows 



146 



that the largest possible va^ue for Pj is pJ=(1+tt 0 -5*)/2, and because of 
the restriction on qf, the smallest possible value is 



l-jr 0 +6* 



K - — — + v 6 * . y 

In practice the closed jnTerval [p** p^] will be relatively short. For 
example, if * Q =.S and 6*=.l, p'^.85, p*£=.775.. Since P 2 =P 1 -ir 0 +6* and 
q=l-p 1 -p 2 , the PCD can be written as a function of p,, and the value of 
Pj that minimizes the PCD can be determined. ' 

. A similar approach can be used for the case t=ttq+6*. the lowest 
possible value for p-j is 



' t ' "0* 

and the largest possible value is (l+Tr n +s*>/2. 

Although the value of pj can be determined that minimizes the PCD, 
there will be instances where this will be inconvenient and possibly 
expensive to do. However, it is possible to obtain a conservative choice"- 
for n by\onsidering the case p 1 =(l-hr () -6*)/2 and q=0. Then the PCD is 



equaUto . 

x 0 _1 , 

\n-x 



x=o W l x 



wheVe Xq is the smallest integer greater than or equal to n(Tr Q +l)/2. 
This situation yields a conservative value for n is the sense that for 
values of r not in the indifference zone, t achieves its maximum variance when 
Pf(l + Tt Q -$*)/2 and q=0. 



147 



For this particular value o1*Pp and sitice q=0, results in FhaneY (1974) 
and Wilcox (1979a), can be-applied. In particular, an ^proximate solution 
for n is — 

X 2 (l-H; 0 )(l-u 0 ) 
"= ; ° 2 ° (14) 

where x is the P* quantile of the standard normal distribution. 

Suppose, for example, P*=.9, 6*=.l and ir Q =.8. To ensure that the 
PCD>.9 for any t not in the indifference zone, equation 14 says that 
approximately n=59 items are required. For P*=.95, n=97. 

Wilcox (1980b) also considered the situation where 6*=.l and P*=.9 

but where the usual correction for guessing formula score was used. It 

was found that varying the actual probability of guessing the correct 

response had a substantial effect on the test length. In one instance. the 

i 

required test length was found to be 159, and in another it was 281. As 
indicated above, an answer-until -correct scoring procedure requires only 
59 items without assuming guessing is at random. Thus, the results reported 
here are considerably more encouraging than those reported by Wilcox (1980b). 



v 



148 



• / 





0 

ERIC , 



. REFERENCES 

Berk/R* R. A consumer^ guide to criterion-referenced test reliability. 
/ * **. 

Vfournal of Educational Measurement , 1980, 17, 323-349. 

^fiss, L B. A test of Lord's" assumption regarding examinee guessing 

r 

behavior on multiple-choice tests using elementary school students. , , 
Mournal of Educational Measurement, 1980, 17, 147-153. 
Cross, L. H., & Frary, R. B. An empirical test of Lord's theoretical 

results regarding formula : scoring of multiple-choice tests. Journal 

of Educational Measurement , 1977, 14, 313-321. 
Fhaner, S. Item sampling and decision making in achievement testing. 

British Journal of Mathematical and Statistical Psychology, 1974, 27, , 

172-175. 

Huynh, H. Statistical consideration of mastery scores, Psychometrika , 
Ts/^ > 1976, 41, 65-78. 

van den Brink,- W. P.,* S Koele, P. Item sampling, guessing and decision- 
making in achievement testing. British Journal of Mathematical and 
and Statistical Psychology , 1980, 33, 104-108. 

Wilcox, R. R. Applying ranking and selection techniques to determine 

the length of a mastery test. Educational and Psychological Measurement, 
1979, 31, 13-22 (a). 

Wilcox, R. R. An approach to measuring the achievement or proficiency of 
an examinee.- Applied Psychological Measurement , 1980, 4, 241-251 (a), 

Wilcox, R. R. Determining the length of a criterion-referenced test. 
Applied Psychological Measurement , 1980, 4_, 425-446 (b). ' 

Wilcox, R. R. Some empirical and theoretical results on an answer-until- 
correct scoring procedure. British Journal of Mathematical and Statistical 
Psychology , 1980, submitted for publication (c). ^ / 

. ' 149 



10 



Wilcox,^. R. Recent advancefe-in measuring achievement: A response to 
Molenaar. British Journal of Mathematical Statistical Psychol ogy > 
1981, in press (a)." 

Wilcox, R. R. Solving measurement problems with ao answer-until -correct 
scoring procedure. Applied Psychological Measurement , 1981, in press (b). 



A CWSED SQUENTIAL PROCEDURE FOR 



c0mparin6 the binomial distribution 
"to a standard v 



0 



* Rand Rl Wi1cox r ( 



r 



DEPT. of "PSYCHOLOGY 

UNIVERSITY- OF SOUTHERN CALIFORNIA 

t 

& 



. * CENTER FOR THE ^Uflf OF EVALUATION 
* Graduate School of Education 
University of California . Los Angeles 



■ffriT-^ ft 



151 



* 

ABSTRACT • • 




Fhaner (1974) proposed an approach .to treasuring achievement where 
the binomial error model, is assumed, and v/here the goal is to determine 
whether an examinee's percent correct true score is abov 0 e or below a 
known constant. Wilcox (1980b); as well as van den Brink & Koele (1980) 
point out that a substantially larger number of items might be required 
when guessing is incorporated into Fhaner's solution. The purpose of 
this brief note is to derive the exact sampling distribution of a 
closed sequential procedure that solves the problem considered by Fhaner. 
We then show that the probability of a correct decision under the new 
procedure's exactly the same as it is, when Fhaner's procedure is 
applied. In addition, the number of observations^ under the closed, 
sequential procedure is always less than or e.qual to the number 
required under the fixed sample size approach. In some cases, the 
number of observations is considerably less. 



* 

152 



f 

In the context of mental test theory, FhaneY (1974) considered the- 

» 

problem of comparing a binomial probability function to a standard or 
known constant. More specifically, it was assumed that a -random variable 
x has a density given by c 

(xj e (l-e) (l) 
and that we want to determine whether e. is above or below a known constant 
6q. (For a recent review of the binomial error model, see Wilco.x, 1981.) The V 
goal in Fhaner's paper was to determine the minimum N so that simultaneously, 



N fN] x 'N-x * 

8 (1-8) > P' > whenever e>e 0 +s* * (2) 



l 

x=n 



x 

V / 



and n-1 
Z 



x 

S J 



e (l-e)' 9 whenever e<3 n -5* 



Cf 5 * O) 



wljere'% <P <1 and 6 >0are predetermined constants, and n is an appro- 
priately chosen passing score. 

'We note that in Vecent years, the problem considered by Ffianer has 
generated considerable interest in mental test theory. Uilfcox (1980a) 
summarizes existing results* - - 

Suppose we choose n to be the smallest integer such that n/N>8Q. 
An asymptotic solution to determining N satisfying both equations (2) 
and (3) is 

. N = A 2 e o (l-8 o )/(6*) 2 . • (4) 

where $ is the P* quantile of the standard normal distribution *(Wil cox, 



1979a). From (4) it is evident that N becemes indefinitely large as 6 
approaches zero. 



* 



When applying Fhaner's solution to achievement tests, it may be 
necessary to choose 6*small in order to take guessing into- account 
(van den Brink & Koele, 1980; Wilcox, 1980b). This, in turn, might 
mean that a relatively 'large number of items will be required. One 
approach to this problem is to apply a sequential procedure, but these 
are optimal under circumstanceSvthat might not be met (e.g., Wetherill, 
1966). Also, depending on the values of e and 9 /it is possible that 
the number of observations wijl be larger when a sequential procedure 
is applied. ' * 

When a sequential procedure is used, it is common practice to avoid 
taking an inordinately large number of observations by deciding in advanc 

t 

the maximum number of trials that wi.ll be allowed. In this event, 
.however, the observed number of successes \s not given by the negative 
binomial distribution, as it ordinarily would be (e.g., Wetherill, 1966), 
and so we do not know the exact probability of a correct decision about 
whether 9' has a value above or below 9 . 

For the reasons given above, we consider a closed sequential 
procedure for comparing Q to 9 Q ., ^t, we-suppose that N and n are 
determined in tn^manner already described. Here, however, observations 
are assumed to be taken one at a time until there are n successes or 
m=N-H*l failures. Let x be the number of successes and" let y be the 
number of failures when sampling is terminated. Kote that either x=n, 

in which case the possible values of y are 0,1, ,m-l; or y=m and 

the possible values o^x are 0,1,. . .. ,n-l . Our decision rule is that 



/ 



e> 8 n when x=ji; otherwise we decide that 8<8q. The purpose of this brief 
note is to show that the probability of a correct decisiorrtinder this 
closed sequential pVocedure is exactly the same' as it is under the fixed 
sample size solution proposed by Fhaner (1974). We als<j note that the 
expected nj^nber of observations for the closed sequential procedure might 
^be substantially less tha# what would otherwise be required. For related 
results, see AVIing (1966), Armitage (1957), Spicer (1962), Wald (1947), 
Anderson & Friedman (1960). 

The Joint Distribution of x and y 
Let X..-1 or 0, i=l,.... be a sequence of independent trials where 
Pr(x.j-l)-8. The exact distribution of x and % can be derived as follows: 
If x=n, then by the multiplication rule of probabilities, f (x,y|e), 
the joint, probability of x and y, is given by . 



f(x,y|e) v 



n-l+y 
t n-1 J 



a"" 1 (l-e} y -8 



In a similar fashion 



f(x,y|e) = j m ~** x ] (l-ef e x for y=m, x*0, n-1. (6) 



/ 



_4_ 



155 



Let 



The relationship between the closed sequential procedure 
and Fhaner's fixed sample size solution. - 



r 

h(e) - z M e x (l-9) N " x , 8<e 0 



x=0 

N 
Z 

x=n 



e x (a-e)Ke>8 0 , 



In other words, for fixed N, n and any 0, h(e) is the probability of 
correctly determining whether the value of*e is above or below e Q , We 
show that the probability of a correct decision under the closed sequen- 
tial procedure is given exactly -by h(e). -That is> the -accuracy vof both 
procedures is the same, Regardless of the value of e. * 

Suppose the closed sequential procedure is applied tfnd that 8<8q. 
Then the probability of a correct decisiph is . <T 

Pr (y*m|e) * ; , . * . 



fell <i-»)"» 

x=0 I J 



< 

From Patil (1960), this is equal to " 

• v "I 1 (m*n-I)i y x n+ra-l'-x 

- 1 x f Q x!(irtf-n-x-l)< , 11 Q) 9 



_ i m " X Nj /, vx N-x 



" 5 " 156 



x=0 

n-1 



= z 



N! 



K t 0 x!(N-x)! 
= h(e). 



e x (l-8) N - x 



For similar reasons, Pr(x=n|e) = h*(e)^when e>9g. This completes the proof. 

Next we note that the number of observations under the closed sequen- 
tial procedure is at most N, and on the average it is less* How much less 
will, of course, depend on 9 and 8 Q . In some cases, the amount can be 
substantial. 

Suppose, for example, N=100, e Q =.8 in which case we'set n=80^and 
m=21. The number of observations under the sequential procedure ranges 
from 21 to 100. Following^ Fhaner (1974), suppose an indifference zone 
formulation of the problem is used with 6*=. 05, From equation (4), an 
approximate lower bound to the probability of a co/Wct decision is .894 
• when the fixed sample size, procedure is used. The results, given above 
indicate that the same is true when the closed sequential^ procedure is ~ 
applied. ^ l 

Figure 1 shows a plot of E(x+y), the expected number of observations 
using the closed sequential procedure. As is evident, for certain values 
of e, E(x+y) is consi^fably less than 100. As already noted, because 
of guessing, even smaller values of 6* might be deemed appropriate which 
will increase the required value for N. Thus, the clbsed sequential pro- 
cedure might be an important and valuable tool in many situations. Figure 
2 shows a plot of E(x+y) when 8q~.5, n=50 and m=5L 

% -6- 157 



Concluding Remarks 

*■ * 

The new procedure might require the same number of observations as 
Fhaner's, but this will be highly unlikely, particularly when N is large, 
flrr the average, the fiumber of observations will be smaller, ami tn some 
cases, by a substantial amount. Thus, it might be possible to reduce the 
difficulties pointed out by Wilcox tl980b), and van den Brink & Koele 
(1980). Of course* at least N items must be available, and any sequential 
procedure would seem to be inconvenient in certain*situations. However, with 
^ the current interest ip computerized testing, the results reported here 
might be useful . 

We also note that for a population of examinees, our closed sequen- 
tial procedure is easily extended to the empirical Bayes framework con- 
sidered by Wilcox (1977, 1979b).. In particiular, suppose the probability * 
function of every examinee's observed score is given by equations (5) and 
(6). It is readily verified that 




is a maximum likelihood estimate of 8. Therefore, Q is a maximum Tike- 
s' lihood estimate of 6* (Zehna, 1966). Let 8- be the maximum likelihood 
estimate for the ith randomly sampled examinee, i-l,...,m. It follows 
that Mj-nf *Z8* and f^nf *ie^ can be used to estimate the first and second 
moments of £he distribution of 8 over the population of examinees. 

If we^assume the density of 6 belongs to the beta family,, we can 
also estimate test accuracy as is done by Wilcox (1977) and v/e can esti- 
mate test reliability in the manner described by Huynh (1976) by noting 

that a negative binomial density function compounded by a beta distribution 

* __ _- 

' " 7 " 158 



r 




yields the inverse Polya-Eggenberger probability function (e.g., Sibuya, 
1979). The details are straightforward, and so further comments are not 
made. 





9 

ERIC 



-8- 



159 



References 

Anderson, T. W., & Friedman, H. (1960) A limitation of the property of 
the sequential probability ratio test^^In I. Olk'in (Ed.) Contribu- 
tions to Probability and Statistics . Stanford: Stanford University 
Press. . . 

Ailing, D. W. (1966) Closed sequential tests for binomial probabilities. 

• Biometrika , 53, 73-84. 

— ^ 

Fhaner, S. (1974) Item sampling and decision making in achievement 

testing. British Journal of Mathematical and Statistical Psychology , 
27, 172-175. 

9 

Huynh, H. (1976) On the reliability of decisions in domain-referenced 

' testing.' Journal of Educational Measurement , 1^ , 253-264. , 
Patil, G. P. (1960) On the evaluation of the negative binomial distri- * 

butioh with examples. Technometrics , 2, 501-505. 
Sibuya*, M. (1979) Generalized hypergeometric* digansna, and tri gamma 

distributions. Annals' of the Institute of Statistical Mathematics , 

31, 373-390. 

f 

Spicer, C. (1962) Some new closed sequential designs for clinical trials. 

Biometrics , 18, 203-211: ^ ^ 

Van den Brink, W. P., & Koele, P. (1980) Item sampling, guessing and 
Reclsion-making in achievement testing. British Journal of 
^thematical and Statistical Psychology , 33, 104-108. 
Wald, A. (1947) Sequential analysis . New York: John Wiley. 
Wetherill, G. B. (1966) Sequential methods in statistics . London; 
Halsted Press. 




-9- 

160 



Wilcox, R. (1977) Estimating the likelihood of a false-positive and 

false-negative decision with a mastery test: An empirical Bayes 
- approach. Journal of Educational Statistics , 2, 289-307. 
Wilcox,' R. R- (1979) Applying ranking and selection techniques to 
determine the length of a mastery test. Educational and Psycho- 
logical Measurement , 39 , 13-22. (a) 
Wilcox, R. R. (1979) On false-positive and false-negative decisions " 

- with a mastery test. Journal of Educational Statistics , 4, 5$-73. (b) 
.Wilcox, R. (1980) Determining the length of a. criterion-referenced test. 

Applied Psychological Measurement , ^4, 425-446. (a) - 
Wilcox, R. R. (1980) An approach to measuring the achivement or pro- 
. ficiency of an examinee. Applied Psychological Measurement , 4_, 
241-251. (b) 

Wilcox* R. R. (1981) A review of the beta-binomial model and its 

extensions, Journal of Educational Statistics , 3-32. 

< 

Zehna, P. W. (1956) Invariance of maximum likelihood estimation. Annals 
of Mathematical Statistics, 37, 744. 




ABSTRACT ^ 




Wil^ Cl98fe") proposed a latent structure model for answer-until- 



correct tests that can solve various measurement, problems including correct- 
ing for guessing* without assuming guessing is at„random. -This paper. pro- 
poses a closed* sec^e^t+a^k^cedure for estimating' true score that can "be 
used in conjunction with an answer-until -correct test. For eriterion- 

referenced tests where the goat is to determine whether an examinee's 

, »„'.*. • . 

true score is above or below a "known constant, the accuracy of the. new 

. procedure Is exactly the same as a mor| -conventional sequential sokitipru 

• The .advantage of the new procedure is that it eliminates the possibility ' 

of using ^a^^n^dinatsX' large number of items^when in fact a large, number 

^f items is not needed; typical sequential procedures always allow this 

possibility. In addition, the new procedure appears to compare favorably 

to traditional tests where the number of items to be administered is. 

fixed in advance. . * . * ' 




ft 



i. ^nrrcpDucTioN 

•mm / "* • 

Consider a multiple-choice test item with t alternatives, one of * . 
which corresponds to the correct response* Uncf&^r an answ?r-until -correct 
(AUC) scoring procedure** an examinee chooses alternatives until the correct 
response is selected. In the past, this has been accomplished by having 
the examinee erase a shield on an answer sheet; the examinee knows ircme- 
diately whether the correct response was chosen. If it was not, the 
examinee erases another shield, and this process continues until the ,cor- - 

.rect' alternative is chosen. Another way of administering AUC tests is with 

i < 

a recently developed pen that is \ised in conjunction with a specially 
treated answer sheet. The examinee marks his/her selection which causes 
a previously invisible mark to appear on the answer sheet. If the mark 
signifies an, incorrect choice, another alternative is chosen. An optical 
scanner can then be, used to» count the number of attempts an examinee took 
on each item of the test, or, of course, the test^can be scored by hand. 
A third way of administering AUC tests, and the-orfe that is particularly 
relevant to this pap'er, is by computer, 

AUC tests appear to have several advantages* Past investigation? 
suggest they enhance learning (Pressey, 1950), increase reliability (Hanna, 
1975; Gilman & Ferry, 1972), and under certain assumptions,, they can be 
4»ed' to correct for guessing without assumi/g guessing is at random (Wilcox, 
198laJ. Somq implications of the assumptions made by Wilcox (1981a) have 
been empirically investigated, and the results 'suggest they are frequently 
reasonable (Wilcox, in press, a).' 



166 



3 



The ability to measure and correct the effects of guessing is parti- 4 
cularly important* in criterion-referenced testing wher^ the goal is to 
determine whether an examinee's true score is above or below a known 
constant (van den Brink & Koele, 1980; Wilcox,. 1980). Because of guessing, 
r an unreal istically large number of items might be required to ensure a 
reasonably accurate test. 

The goal in this paper is to describe a closed. sequential testing 
procedure that might be used in conjunction with Wilcox's correction for 
guessing formula score. The results reported here generalize those re- 
ported in Wilcox (in press, b). To help motivate the new procedure, a 

*r 

traditional sequential procedure is also discussed. 

While the potential advantages of sequential procedures isjyjown 
(e.g. t Wethenll, 1975), they have the practical disadvantage of possibly 
requiring an even larger number of observations than would be used under 
a fixed sample size approach. On the average this may not happen, but there 
is a positive probability that a sequential procedure wilt need more obser- 
vations. Usually this problem is avoided by deciding in advance the maxi- 
mum number of observations that will be allowed under a sequential procedure, 
but in this case the appropriate probabi 1 ity function may not be known. 
The closed sequential procedure described below is intended to correct 
this prol^em when an AUC test is being used, 

2. ASSUMPTIONS AND GOALS 

This section gives a more precise description of the assumptions 
being made and the goals of the test. 

Consider a domain of skills, and suppose every skill is represented* 
by a multiple choice test item having t alternatives frojn v/hich to choose, 

167 



one of which is correct. Let t be the proportion of skills a specific 

examinee has acquired, and let p. (i = l, . y) be the probability that tfie 

« 

examinee chooses the correct response on the ith^ attempt- of a randomly 
chosen iteitf. Wilcdx (1981a) assumes that if the examinee has acquired the 
skill corresponding to a randomly sampled item, he/sh& gives the correct * 
response on the first attempt. If. the examinee does, not know, it is assumed 
that at most t-2 distractors can ^be eliminated, and th$t the examinee 
guesses at random from among those that remain. This is, of course, an 

i 

over simpjjfi cation of reality since the model does not allow for misin- 
formation, nor the possibility of knowijg and inadvertantly choosing an 
incorrect response. Other 'latent structure models have been proposed 
that include errors'at the item level such as misinformation, but these 
mod&ls make certain assumptions that may not hold in many situations. 
(See Holenaar, x 198l; Wilcox, 1981a; in press b.) 

Based on the above assumptions, it has been shown that -t^Pj-Pg 
(Wilcox, 1981a). This suggests that for an AUC scoring procedure,* if there 
are x^ items for which the examinee is correct on the* first attempt, and 
if there are x 2 items for wfiich the examinee is correct on the second 
attempt, x might be estimated with 

t * (x 1 -x 2 )/n - (^1) 

where n is the number of items on the test. The appeal of equation 2.1 
is that it corrects for guessing without assuming guessing is at random. 
Wilcox's model implies that 

pj > p 2 > ... > p t ■ (2.2) 




and empirical investigations suggest that this inequality will frequently be 
reasonable (Wilcox, in press a). Note that if (2*2) is assumed, a maximum 
likelihood estimate of t, assuming x^ and ^ ^ ave a multinomial distribution 
can be obtained via the pool-adjacent-violators algorithm (Barlow, et al . , 
1972) which is 

' (x 1 -x 2 )/n, x x >x 2 
T = 0, otherwise ' 

The two most common goals of a cri terion-refereijcect test are esti- 
mating true score, and determining whether t is above or below a known 
constant, say Tq (Hambleton, et aK, 19?8), The remainder of the paper' 
considers these problems when a sequential or closed sequential procedure 
is used to estimate x. \ 

s 

t * ■ 

3. A SEQUENTIAL PR INVERSE SAMPLING PROCEDURE 
This section summarizes some existing results on estimating p^ under 
a conventional inverse sampling procedure. The main reason for including 
this section is to motivate the closed sequential procedure described in 
section 4, 

Here it is assumed that an item is randomly sawpled and the examinee 
responds to it according to the AUC scoring procedure previously described. 
Once the examinee identifies the correct response, another item is Randomly, 
sampled and administered, and the process continues until there are N items 
for which the first alternative c(iosen by the examinee is the correct 
response. Once sampling is terminated, let y^ be the number of items for 
which the examinee chooses the correct response on the second attempt of 
an item, and let y^ be the number of items for # which more than tv/o attempts 



169 



/ 



were needed. The probability function of yg and y^ is negative multinomial 
N^wMch is gfive'n by , 

. (N-l+y 2 +y 3 )! m y y 

f(y 2 ,y 3 |p r p 2 ) - y 2 !y 3 i(N-i)! p 2 p 2 2 « 3 Ma" 0 - 1 --) < 3 -*> 

jtfhene Pj and p 2 are defined in section 2, and q^l-Pj-Pg (e.g., Sibuya, 1964). 
Properties of this distribution are summarized by Sibuya (1964), Mosimann 
(1963), and Johnson & Kotz (1969). See, also, Olkin & Sobel (1965), Olkin 
(1972), and^coullos & Sobel (1966). 

The maximum likelihood estimates of and p 2 are P 1 =N/(N+y 2 +y 3 ) and 
P 2 = y2^ N+y 2 +y 3^ respectively. As previously mentioned, x=p 1 ~p 2 , so the 
maximum likelihood estimate of the examinee's true score is T=(N-y 2 )/(N+y 2 +y 3 ) 
(Zehna, 1966). # 

Consider the problem of determining whether x is above or below r n * 
The pbvious solution is to decide t>tq if and only if t>tq. This is the 
typical type of decision rule used with criterion-referenced tests, and 
it is the solution used here.^ Thus, for t>tq, the probability of a corf-ect 
decision (PCD) is 



R = z f(y 2 »y 3 |PpP 2 ) 

A 



(3.2) 



where A={(y 2 ,y 3 ): t>Tq). For t<t q the PCD is just 1--R. 

Given Pj and p 2 , the PCD can be compared to the usual fixed sample 
size solution and some comparisons are made in the next section. The 
expected number of observations is also easily computed, and it is (Jiven by 
N+(p 2 +q)/p r ^ 



9 

ERIC 



170 



An appeal of sequential procedures is that the expected number of 
observations can be substantially less than what is needed under a fixed 
sample size approach. However, as previously indicated, there is a posi- 
tive probability that the actual number of observations will be large. 
In practice this problem is avoided by determining in advance the maximum 
total number of observations that will be allowed. However, if sampling 
is terminated when N+y^y-j reaches a predetermined value, the joint pro- 
bability function of y 2 and y^ is no longer given by the multinomial 

>» 

distribution. The next section proposes a possible solution to this 
problem when determining whether x is above or below tq. 

- 4. A CLOSED SEQUENTIAL PROCEDURE 
Suppose the sequential procedure in section 3 is used in which case 
t<Tq is decided if * 

N-y 2 

/ 

Rearranging terms, the decision t<Tq is made if _ 

N(l-x 0 ) < (l+x 0 )y 2 + (4.1)* 

Thus, once (l + ?o)y 2 ~N(1-Tq), or x^f^ j> N(1-Tq) , .there is no N point in 
sampling more items because the decision t<tq will be made no matter how 
well the examinee performs on the remaining items. 

Next suppose the inverse sampling scheme is modified so that sampling 
terminates v/hen y^=N, or y 2 =ft or yg=m where y^ is the number of items 
for which the examinee is correct on the first alternative^ :hosen. For 
the moment M and m represent arbitrary integers. 

r 

ill 



8 



The joint probability function of y^ y^ and y 3 can be derived in the 
same way as was the distribution in Wilcox (in press, b), and so the details 
are not "given. An alternative derivation is also available by viewing the 
process as, a random walk on a three dimensional lattice, but again the 
details are relatively straightforward, and so they are omitted. The 
result is that the joint probability function is given by 



(N-l)!y 2 'y 3 ! P x f q (yfN,0<y 2 <M,0<y 3 <m) (4.2) 



) 

yi i(M-l)!y 3 i \ l (0iyi<N»y 2 =M,0<y 3 <m) (4.3) 



(y x +M-l + y 3 )! y M y 

~ j- n n ° 



(y i+y^in-i)! y y m 

y 1 ty 2 !(m-l)! Pi P 2 2 q (0iyi<N>0<y 2 <M,y 3 =m) (4*.4) 



The discussion of the decision rule under the sequential procedure' 
suggests that the closed sequential solution be used wUh M=N{1-Tq)/(1+tq) 
and m=N(l-T 0 )/TQ. If sampling terminates because y 2 =M or y 3 =m, occurs, the 
decision t<t q is made. If sampling stops because y^N, decide t>t 0 if 
and only if (N-y 2 )/(N+y 2 +y 3 )>TQ. This is the same decision rule used 
under the sequential procedure described in section m 3, but this rule can be 
justified based^ solely on the probability function in equations 4.2, 
4.3 and 4.4* To see this, note that the maximum likelihood estimate of 
p. 0=1,2,3) under the closed sequential procedure is 

* * y 1 

p i = y^y 3 * (4 ' 5) 



172 




where one and only one of the y^s has attained its maximum value. By 
the choice 6f M and m, the decision x<x n is made if y 2 =M or y 0 =m because 



*3 = 



equation 4.5 yields an estimate of x=p 1 -p 2 that is less than T q. If 
y-j=N, the decision t>tq is reached if (N-y 2 )/(N+y 2 +y 3 )>XQ. 

The above disfcussion reveals the important result that the PCD under 
the closed sequential procedure is exactly the same is it is under the 
sequential procedure. To see this, note that for t>tq, the PCD under 
the clbsed sequential procedure is 
N-y 2 



n y 9 y 



(4.6) 



which is the same as expression 3.2. It follows that the PCD is also the 
same under the two procedures for t<tq. ' v 

A Comparison of the Fixed Sample Size and Closed Sequential Solution 

For a conventional item sampling model where the total number of items 
is fixed at n, the random variables x^ and x^, which were defined in section 2, 
have a multinomial distribution. Thus, when comparing x to Tq and when 



t>tq> the PCD is 



nlp^p^q i * 



B x 1 !x 2 i(n-x 1 -x 2 ) ! 



(4.7) 



where B={(xpX 2 ): ('X 1 "X 2 )/n>jQ}. For--<T 0 the PCD is just one minus 
this quantity. , 

To compare the fixed and closed sequential procedure, the PCD was 
calculated for n=14, N=10, Tq=-7, Pj=.£te and .075<p 2 <. 15. This interval 
for p 2 v/as used because it is consistent frith the assumption in equation 
2.2 when pj-,85. The results are shown in Figure 1 where tfie curve PS and 
P are the PCD under the closed sequential and fixed sample size procedure, 



:RLC 



J73 



10 



respectively. As can be seen, the closed sequential procedure is consis- 
tently better. As an additional comparison, the PCD was computed for 
Py 1 .! and .15<p ? <.30. The results are plotted in Figure- 2 and again the, 
closed sequential procedure is consistently better. 

* 

CONCLUDING REMARKS 
It has not been shown that the closed sequential procedure will always 
improve upon the fixed sample size approach to criterion-referenced tests 
when Wilcox's answer-until -correct scoring procedure is used. However, 
all indications are that given n, we can choose N, M and m so that the 
number of observations under the closed sequential procedure will be at 
most n, and yet it will give superior results. Moreover, the expected 
nisnber of observations will be less. Thus, in situations where computer- 
ized testing is feasible, it would seem that the closed sequential proce- 
dure should be given serious consideration. 



174 



11 



REFERENCES |# 

Barlow, R., Bartholomew, D. , Bremjjfer, J., Shrunk, H. Statistical inference 
under order restrictions . New York: Wiley, 1972. 

CacopllQs, T., &Sobel, M. An inverse sampling procedure for selecting 
the most probable event,in a multinomial distribution. Multivariate 
Analysis: Procgeding^fgm International Symposium .* (P,R. Krishmuiah, 
ed.) Acadeprfc' Press, New York, 1966. 

Gilman, D. A. , & Ferry, P. Increasing test reliability through self- 
scoring procedures. Journal of Educational Measurement , 1972, 9», 
205-207. 

Hambleton, R. K. , Swaminathan, H., Algina> J., & Coulson, D. B. Criterion- 
+ referenced testing and measurement: A review of technical issues and 

developments. Review of Educational Research , 1978, 48, 1-47. 
Hanna, G. S. Incremental reliability and validity of multiple-choice 

tests with an answer-until -correct procedure. Journal of Educational 

x Measurement , 1975, 12, 175-178. 
S) — — 

Johnson, N. , & Kotz, S. Discrete distributions . New York: Wiley, 1969. 
Moleriaar, I. On Wilcox's latent structure model for guessing, • British 

Journal of Mathematical and Statistical Psychology , 1981, 34, in press. 
Mosimann, J. E, On the compound negative multinomial distribution and 

correlations among inversely sampled pollen counts. Biometrika , 

1963, 50, 47-54. 

01 kin, I., &. Sob«£, M. Integral expressions for tail probabilities of 
the multinomial and negative multinomial distributions. Biometrika , 
1965, 52, 167-179. 



175 




9 

ERIC 



Pressey, S. L Development and appraisal of devi providing immediate 
automatic scoring of objective tests and concomitant self-instruction. 
The Journal of Psychol ogy, 1950, 29, 419-447. 

Sibuya, M. , Yoshimura, I., & Shimizu, R. Negative multinomial distribution. 
Annals of the Institute of Statistical Mathematics , 1964, 16, 409-426. 

van den Brink, W. P. , & Koele, P. Item sampling, guessing and decision- 
making in achievement testing. British Journal of -Mathematical and 
Statistical Psychology , 1980, 33, 104-108. 

Wetherill, G. B. Sequential methods in statistics . 'London: Halsted 
- Press, 1975. t 

Wilcox, R. R. Determining the length of a criterion-referenced test. 
Applied Psychological Measurement . , 1980, 4_, 425-446. 

Wilcox, R. R. Solving measurement problems with an answer-until -correct 
scoring procedure. Applied Psychological Measurement , 1981, 5_, 
in press, (a) 

Wilcox, R. R. Recent advances in measuring achievement: A response to 

Molenaar. British Journal of Mathematical and Statistical Psychology , 
1981, in pr6ss. (b) 

Wilcox, R. R. Some empirical and theoretical results on an answer-untiU 
correct scoring procedures British Journal of Mathematical and 
Statistical Psychology , in press, (a) 

Wilcox, R. R. A closed sequential procedure for comparing the binomial 
distribution to a standard. British Journal of Mathematical and 
Statistical Psychology , in press, (b) 

Wilcox, R. R. Determining the length of multiple-choice criterion- 
referenced tests when an^ansv/er-until-sorrect scoring procedure 
is used. Educational and Psychological Measurement, in press, 
(c) 

Zehna, P. W. Invariance* of maximum likelihood estimation. Annals of. 
Mathematical St-Kistics , 1966, 37, 744. 

176 






0.70, 
0.15 



'FlgOVe 2 



0-30] .-. JM 




ERIC 



178 



."Vs. o 



Approximating the Probability of 'Identifying the 
Most Effective Treatment for the Case of Normal Distributions 
Having Unknown and Unequal Variances 



Q 



J 

Rand R. Wilcox 



4 * 

4 Department of Psychology * - 

i 7 " r 
University of Southern California - • 

. . and 7 

C'epter /or the Study of Eva! uation . - • 

University of California, Los Angeles 

• • < ■ ■ 



ABSTRACT 



When comparing k normal populations, an investigator might want 
to know 'the probability that the population with the largest population 
mean will have the largest sample mean. Put another way, what is the 
probability of correctly identifying the most effective treatment? 
The paper describes and illustrates methods of approximating this 
probability when the variances are unknown and possibly unequal* The 
results described here^can also be used to measure the extent to which 
the populations differ for one another . 



<5f 



Jf 



L 



Consider k normal distributions with means n- and variances 
a? (i=l,...,k). In psychology aid education it is'common practice to 
test the hypothesis that ^=...=1^. 'If the null hypothesis is rejected, 
there are many instances when an investigator wants to determine which 
of the distributions has the largest mean. If for example, three 
methods of treating depression are being compared, or perhaps three - 
methods of teaching statistics, an investigator might start by testing ' 
whether the population means are equal. If the null hypothesis is 
rejected, interest shifts to determining the most effective method. 
The obvious choice is the treatment with the largest sample mean. Once 
a treatment has been selected as the one most effective, it is only 
natural to want to determine the probability that the most effective 
, treatment was indeed selected, i.e., we want to determine the probability 
that the distribution with the largest population mean will have the 
largest sample mean. Note that if this probability were known exactly, 
we would have a measure of the extent to which the treatments differ 

from one another (cf. Hays, 1973, pp. 481-491, Cleveland & Lachenbruch, 
1974). \ 

Typically, the approach to the problem just described is from the 
point of view of designing an experiment (e.g., Gibbons, 01 kin, & 
Sobel, 1977), In particular, procedures have been devised for deter- 
r mining how many observations are needed so that an investigator can 
be reasonably certain that the most effective treatment is identified. 
The normal case has been considered by^Bechhofer (1954), BechWer, 
Ditfinett a'nd Sobel (1954) and Dudewicz and Dalai (1975). These solutions 
are similar tp determining pCrwer, but tfe-fe are important differences. 



181 



4* 

4 * 



Also, these solutions are highly conservative in the sense that if 
y [k]" p [k-l] - 6 * tHe Probability of. a correct selection is at least 
P* .where >»[| c ]L*-*3 l rt]„ are the population means written in descending 
order and where 6* and P* are predetermined constants. The va*lue of 
6* represents the small est 'difference between y r . -, and v , ,1 the 
experimenter believes worth detecting. In actuality the difference 

P [k]~ p [k-1] nri9ht be consid e rabl y larger than 6* in which case fewer 
observations are really needed to guarantee that the best treatment 
is selected for use. 

Recently, Tong (1978) proposed an adaptive sequential approach to 
the problem of identifying the most effective treatment for the case of 
normal distributions having. a common known variance'. The- mo£ivation 
for the procedure is to take advantage of situations where v^-j-h^-j 
(i=l,...,k-l) is large. The basic idea is that if the population means 
•are substantially different fewer observations are needed than when 
the differences are small,. say equal to'6*. A crucial step in applying 
this solution is estimating the probability that the distribution with 
the largest population mean will produce the largest sample mean. A 
method of estimating this value is available, but it requires numerical 
quadrature which can be rather expensive to use. Accordingly, Tong 
uses bounds on this probability (OlkSn, Sobel, & Tong, 1976) that ' 
are easily computed. The purpose of this paper is to describe, , and 
illustrate methods of estimating similar bounds when the variances are 
unknown and unequal . i • . , p • - • ■ 



Description of the Procedure 

\ 

Let x.- (1=1,..., k; j=l,...,n+l) be n+1 randomly sampled observa- 
J n 
tions from the ith_ normal distribution. Compute x. = £ x. ./n, ' 
2 n 2 1 j=x 1J 

s- = I (x.. - x.) /(n - 1). For technical reasons explained below, it 

3=1 J o 
is necessary to assume that n+l>s.. This is not a serious restriction 

1n practice since the possible values of x. . are usually bounded. If, 

for example, there are known constants a_ and b^such that 0<a<x- .<b, 

o 

and \f every is divided by a+b, will be less than one, and the 
results described below can be Applied. 
Next compute 



where 



x i = I a ij x ij- O) 



a. , , = 1 - nc- 
1,n+l l 



a il B a i2 = = a in = c i 



and 



c, = 



n + /n 2 - n(n fl>(l - sT 2 ) 



1 n(n + 1) / " ^ 



For technical reasons Dudewicz and Dalai select the treatment with the 
largest value as the one that has the largest mean. In practice this 
will usually be the same as selecting the treatment with the largest 
sample mean. a ^ 

If X/^j^is the val'ue of^)*^ the population having mean Pr-jp " 
then* the probability of a- dorrect selection (PCS) is the probability- ^ 
that the' distribution ^ith rn^ean p^-j uill have the largest x^ value* Thi 
probability is given by 



^ 183 



pr ( x (i) ± x (k)' i=1 »"*» k - 1 ) 

= Pr(x (i) -u £i] <x (k) -U [k] + « 1; ; M,...,k-l) • (2) 

whene 6.= Wr^r-Prf] (i=l ,...,k-l). From Dudewicz and Dalai (1975, 

p. 38), x.-Ur^-j has a t distribution with v=n-l degrees of freedom. 

Thus, (2) ,is equal to / 
^ k-1 \ 
Jl I F v (z + 6.)f v (z)dz . (3) 

where F v and f v are the cumulative distribution and density function, 
respectively, of a t distribution with v degrees of freedom. (In 
Dudewicz and Dalal's notation we are setting h=6*=l.) 

. . From a theoretical point of view expression (3) follows from 
Theorem 4.1 in Dudewicz and Dalai which assumes that a two-stage 
sampling procedure is being used. In the first stage n observations 
are taken, and the second stage consists of taking n.-n additional 

* 1 

observations sampled from the 1th_ normal population where 

2 * 
n* = max [n + 1 , s^] " % 

r 

A slightly more general expression for c- is also required, namely, 

nj - i +/ (n i " ] * 2 " (n i " ~ s i 2 ) 



C 1 



- l)n i 



In many situations a two-stage sampling procedure may be. expensive or 
impractical, and so we have outlined how this problem might be avoided. 
However, when sampling is from a truly normal distribution, a two- 
stage procedure must be used in conjunction with* the more general ex- 
pression for c,. ^ ] 

To estimate the probability of identifying the most effective 
treatment, i.e., the probability that the population with mean ji. 



l Ck3 



1 

produced the largest value, simply replace 6^ in equation (3) with 

5 i~*[k]~*[i] w ^ ere *[i] ^ s sam P^ e mean corresponding to the 
population that produced the ijbh. largest x. value. 

Bounds on the Probability of a Correct Selection 

So far, nothing particularly new or unusual has been described; 
we have 'merely followed the developments. irt 01 kin> Sob el , and Tong 
(1976). The only difference is that the procedure in % Dudewicz and Dalai 
(1975) was used to handle the unknown and possibly unequal variances; 
Ofckin et al. assume the variances are known. The main concern in this* 
section is evaluating (3). This can *be done with numerical quadrature 
techniques (e.g., Dudewicz, Ramberg, & Chen, 1975), but this can be 
expgnsive, particularly when the degrees of freedom are small. Accordingly, 
we derive upper and lower bounds on (3). 

Our main result can be described as follows: Let 

Pi = /I F v (z + « 1 > f v (z)dz 
q, = 1 - P i 

q ia . = 1 + P iPj - Pi - Pj • • W 

Q-i = I ^ 

1 1=1 1 i 

Q« = max I q. . , "where the summation is from 1 to kyL 

Values of the integral in the definition of are given in a table in 
Dudewicz and Dalai (1975, p. 52). Recalling that the PCS is "given 
by (3), it will be shown tha% < ' ** "* 

PCS > 1 - Q 1 + Q 2 ' " (4) # 



185 



To establish (4), the following definition is required. Let 
A s (a^,...,a k ) and Bf (b^ ..jb^) be any two vectors, and let jl^]^**— a 
and bp^bj.^. ••ikp^ be the components of A and written in ascending 
order. A function $ "is Schur-concave if 

and 

,k k 

> 

implies that 

♦ (A) >*(B) 
(e.g., Marshall & OHcin, 1979). 

Fi^m Theorem 6.2.5 and Corollary 1 in Tong (1980, pp. 110-111) we 
have that n F (z+5.) is a Schur-concave function of the S.'s which 
implies that (3) is Schur-concave as well. Thus, an upper bound to 
(3) is 

Dv" 1 !^ «)f v (z)dz . (5) 

k-1 

where 6 = £6.»/(k-l). The integral in (5) can be evaluated-via the tables 
i=l ' 

1n Dudewicz and Dalai (1975). 

From Kimball (1951) a lower bound £o"(3) is 

£ Ov(^6.)f v (z)dz . ' . ^ C6) 

But Theorem 7.1.4 in Tong (1980; p. 147) implies that 

* * 

PCS > 1 - + max I /IF (z + 6 )F (z+ 6 .)f (z)dz..' „ 
J ifj J > 

Applying (6) to. the summation in this last inequality establishes (4). 
For certain refinements of (6), see OIMn, Sobel , and Tong (1976). 



■ ■ 186 , 

"r 



V 



Some Illustrations 



To illustrate how the bounds on the PCS compare to the actual 
value, Monte Carlo techniques were used to evaluate (3) using arbi- 
trarily chosen 6^ values. Column 1 in Table 1 shows the resulting 
approximations to (3) based on 2,000 iterations. Our computer program 
was checked by approximating some of the values in the tables reported 
by Dudewicz <frid Dalai (1975). 

Table 1 suggests that when the value of (3) is relatively small, 
the upper bound given by (5) will be fairly close to the value of (3). 
More importantly, when (3) has a value close to one, the bounds given 
by (4), (5), and (6) yield a reasonably short interval which contains 
(3). The impliwtLfin is tnat if , -fer-exampTe, we want to know whether 
the estimated PCS is at least .95, (4), (5), and (6) may give a fairly 
good indication of whether this is true. 

As a final illustration, we reanalyze some data in Winer (1971, 
p. 153). The goal was to compare three methods of teaching-, and there__. 
were 8 observations fo¥ each group, The observed scores are shown in 
Table 2. 

Using the first seven observations in each group, we fifed that . 

Cj-.2855, c 2 =.2959, and c 3 =. 2852. Thus, ^=-.7060, x 2 =4.112, and x 3 =6.148, 

and so according to the procedure in Dudewicz and Dalai, method 3 

would be chosen as the most effective. (It is reardily verifie d t hat 
2 

s.<7 for i=l, 2 as was required.) The question arises as to how certain 
f * 

we can be that method (3) is indeed the best. Since ^=4.75, x 2 =4.625, ^ 
and x 3 =7.75, we have that 6-j=3.0 and 6 2 =3.125.^ From a table in Dudewicz 
;and Dalai (p. 53), the value of (5) is approximately .93." The lower 



187. 



8 



bounds given by (4), and (6) are both .925. Thus, in this particular 
instance, we have a very good approximation to the estimated PCS. 
If an investigator wants the PCS to be even higher, the data indicates 
that additional observations must be taken. " 

Concluding Remarks 

« 

It is possible to sequentially estimate the PCS by applying the 
procedure described here in the manner proposed by Tong (1978). Many 
of Tong's theoretical results extend immediately to the present situation, 
and so further comments are omitted. « 

Another point is that there are alternative choices for the c. , 

y 1 i 

values (e.gT; Dudewicz, Ramberg, & Chen, 1975), but at present there 

seems to be no compelling reason for choosing one procedure over 

another. For a third possible procedure, see Bishop and Dudewicz (1978). 

Henery (1981) proposed a method of estimating" the PCS when the 

distributions are normal with a common knofyi variance. We checked 

the accuracy of this procedure by approximating various values in the 

tables reported by Bechhofer (1954) — similar checks were not made by 

Henery. We got re^sonably<gooti results for k=2,3 and when the PCS was 

\ less than or equal to .82, but otherwise the approximation was very 

poor. Despite this negative finding, a modification of Henery's pro- 

cedure was*tried on the case of unknown and unequal variances, but there 

is no inditation that it would ever, have -any practical value* At the 

moment, the best approach seems to be to use the bounds on the PCS 

given by (4), (5), and (6). • - 

Finally, as alluded to earlier, the results given here can be used 

to measure the' extent to which k.noraal populations differ from one ' 



188 



another. If v-j = u 2 = ••• = *V tne pcs is t0 k_1 * ts ml ml mum 

possible value. As the 6^ values increase\ so does the PCS (cf. 
Hedges, 1981). \ 



r 

189 



10 



References 

Bechhofer* R.E. A single-sample multiple decision procedure for ranking 
means of norma^ populations with known variances. Annals of 
Mathematical Statistics , 1954, 25, 16-39. 

Bechhofer, R.E., Dunnett, C.W., & Sobel, R. A two-sample multiple' 
decision procedure for ranking means of normal populations with 
a common unknown variance.- Biometrika , 1954, 41_, 170-176. 

Bishop, T., & Dudewicz, E. Exact analysis of variance with unequal 
variances: Test procedures and tables. Technometrics , 1978, -20, 
419-430. 

Cleveland, W.J., & Lachenbruch, P.A. A measure of divergence among 

several populations. Communications in Statistics , 1974, 3, 201-211. 
Dudewicz, &.J., Ramberg, J.S., & Chen, H.J. New tables for multiple 

comparisons with a control (unknown variances). Biometrische 
^eitschrift , 1975, 17, 13-26. 
Dudewicz, E.J., & Dalai, S.R. Allocation of observations in ranking 

selection with unequal variances. Sankhya , 1974, Series B, 37, 

28-78. 

Gibbons, J.D., Olk^T., & Sobel, M. Selecting and ordering popula- 

tions: A new statistical methodology . New York: Wiley, 1977. 
Hays* W. Statistics for the social sciences . New York': Holt, Rinehart 

and Winston, 1973. * 
Hedges, L.V. Distribution theory for Glass's estimator of ' effect size 
1 and related estimators. Journal of Educational Statistics , 1981, 

6, 107-128.' ' • . 

Henery, R.J. Permutation probabilities as models for horse faces. 

Journal of the Royal Statistical Society , 1981, Series B, 43, 86-91. 



11 

* 

Kimball, A.W. On dependent tests of significanc^a-the analysis of^ 

variance. Annals of Mathematical Statistics , 1951, 22, 600-602. 
Marshall, A.W., & 01 kin, I. Inequalities: Theory of majorization and 

its applications * New York: Academic Press, 1979. 
Olkin, I., Sobel, M., & Tong, Y.L. Estimating the true probability * 

of correct selection for location and scale selection for location 

and scale parameter families (Technical Report No. 110). Stanford 

University, Department of Statistics, 1976. 
Tong, Y.L An adaptive solution to ranking and selection problems. 

The Annals of Statistics , 1978, 6, 658-672. 
Tong, Y.L.. Probability inequalities in multivariate distributions . 

New York: Academic Press, 1980. ^\ 



191 



0 



TABLE 1 

Illustrative .Bounds on (3) for n 



o ■ 



10. 



Approximate 
Value of (3) 




.470 . 


.5 


1.0 


1.0 




.508 


.5 


1.0 


1.5 




.981 


3.6 


4.2 


5.1 




.968 


3.3 


4.1 


4.2 

V „ 




.592 


1.1 


1.4 


1.6 


1.7- 


.815 


2.0 


2.1 


2.7 


2.9- 


.991 


4.3 


4.7 


5.1 


5.9 


1 










.827 


1.7 


2.8 


3.4* 


3.5 3.9 



Val ue Val ue Val ue 
of (4) v of (5) -»of (6) 




.306 


.470 


.348 
.391 


.364 


.528 


.977 




.976" 


r.964| 




.965 


.407 


>600 


.459 


.747 


".830 


.755 


.988 


m 

.992 


."388 


.790 


^895 


.792 




\ 



A -CAUTIONARY NOTE ON ESTIMATING THE 
RELIABILITY OF A MASTERY TEST WITH 
THE- BETA-BINOMIAL MODEL '. 



CENTER FOR THE STUDY Off. EVALUATION 
Graduate. School of Education 
University of California; . Los Angeles 

and the j 

DEPARTMENT OF PSYCHOLOGY 
University of Southern California 



The project presented or reported hehejin was performed pursuant to a grant 
from the National Institute of Education, Department of Health, Education^, 
and Welfare. However, the opinions expressed herein do not neces*s3rrily\ 
reflect the positioner policy of the National Institute of Education, and 
no official- endorsement* by .the National! Institute of Education should be 
inferred. / ^ \ 



ABSTRACT " \ 

Based on recervb^published papers > one might be tempted to routirrely 
apply the .beta-binomial model to obtain a single administration estimate 
of the^reliability of a mastery test. Using real data, the paper illus- 
trates two practical problems with estimating relia5ility in this manner. 
The first is that the model might give a poor fit to data which can seri- 
ously affect the reliability estimate, and the second is that inadmissible 
estimates of the parameters in the beta-binomial model might be obtained. 
Two possible solutions are described and illustrated. 



S 



195 



f 

1. INTRODUCTION * \ 

In recent years, effQrts fyave been diVectgd toward deriving ways of 

studying and characterizing mastery and criterion-referenced tests. A 

summary of the statistical and psychometric techniques that have evolved 

♦ 

can be found in the 1980, special issue of Applied Psychological fleas urement 
(see, also, Jtembleton, et al., 1978). One approach that has. received con- 
siderable attention can be described as follows: Suppose two randomly 
parallel test forms both consist of n dichotomously scored items. For 

— ^ — -v / 



a randomly sampled examinee, let x and y be the observed scores on the 

two test forms, and let f(x,y) be the joint probability function of x 

and y for the population of examinees. If the same: passing score, say 

Xq, i;susedon both test forms, the proportion of agreement is defined 

to be - 

n n x 0~ l V 1 

P = z i f(x,y) + . E . X ' f(x,y) . (1) 
x=x Q y=x Q * • x=0 y=0 

Many other methods hatfe been proposed for characterizing mastery tests, 
but at a minimum we want P to be reasonably close to one. 

Frequently it is difficult to administer two ran domly^ parallel tests 
to a random sample of examinees. Accordingly, efforts have been made to 
derive an estimate of P based on the observed scores of only one test 
form. A general approach to this problem is as follows: For a specific 
examinee, assume the probability of an observed score x is f(x|e), where 
e is some unknown parameter, possibly . vector valued. For the randomly 
parallel test, let f(y|e) be the probability of an observed y, and sup- 
pose f(x|e) and f(y|e) are independent and they -have .the same parametric 



form.. If /(e) is the density function of 0 over the population of exam- 
inees, then » / 

f(x,y) = /f(x|e)f(y|e)g(e)de. s , „(2) 



•Once a specific form for f(x|e) and g(e) is assumed, it is frequentty 
possible to estimate g(e) which yields an estimate of f(x,y). ' This in- , 
turn, yields an estimate of P via equation (1). 

In the statistical literature, the ^ngle administration estimate of 
P describe abpyejs known as an empirical Bayes approach to prediction 
analysis. For general results dn prediction analysis^ see Aitchison and 
Dunsmore (WSj?^-****- '* ?f J "* J ^-.. / 

Huynh (1976). has given a detailed account of how to estimate JP 
for the special case where f(x|e) (and f(y(e)) are assumed to be binomial, 1 
and where g(e) is assumed to belong to the beta family of distributions. 
Note, however, that Huynh concentrates on estimating Cohen 's'kappa (Cohen,. 
1960), rather than P, once the estimate^ f(x,y) is available (cf. Divgi, 
1980). Since Huynh's paper, several investigation of the beta-binotoial 
model have been reported that are relevant to estimating reliability via 
equation (2). For example, Sukkoviak (1978) ^compared it to 'three other 
.estimates of P and concluded^that all four methods gave good Results, 
but that the beta-binomial model seemed to be the best for general use. 
Additional empirical support for the beta-binomial model can be found in. 
Woss and SiVulman (1980). For further results and comments on P* 'see 
Algina and Nbe (1978), Huynh ()979), Divgi. (1980)., Traub and Rowley (1980), 
and S.ubkoviak (1980). For a recent review of the beta-binomial model, 
see Wilcox (1981). 



197 



Based on the studies cited above, one might be tempted to routinely 

apply, the beta-binomial model when estimating the proportion of agreement 

* ik 

or some related coefficient ^such as Cohen f s' kappa. In practice, though, 

< * ■* 

there are at least^twa practical problems that might arise. First, the 
beta-binomial model might give a poor fit to the data (Keats, 1964a) • 
wlyich, as ill ustratedjBelow, might affect the estimate oj P. Second, 
the estimate oKthe^ parameters in the -beta-binomial model knight be inad- 
missible. That is, theV might bet negative even though the model assumes 

( they- are positive. , Negative, estimates carv occur even when the model 
hol,ds, or theyjnight occur becausfe the moJel is completely inappropriate. 
In some instances it might be possible to coVrect this 'problem by replacing 
the estimates used by Huynh (1976} with the approximation to^maximum like- " 
lihood .estimates described by Griffiths {1973l\ However, Griffiths iter- 
ative estimation procedure might not correct the problem since it can 
converge to inadmissible estimates even when the mgdel holds (Wilcox, 1979). . 

,The purpose of this paper is to describe and illustrate a partial solution 
to these two problems. 

2. TWO ALTERNATIVES TO THE BETA-BINOMIAL MODEL 

Temporarily consider a single examinee responding to n dichotomously scored 
items. The binomial error model assumes that 

, f(x|e).= ( n x ] e x tf-e) n - x „ (3) 

This assumption is theoretically justi fie^ when items are randomly sampled 
from an infinite item pool (or -a finite pool with replacement), the exam- 



inee's responses are independent from one another, and the probability of 
a correct response is 8 for every randomly sampled item. In many instances 



•198 



items are not randomly sampled, and* even when they are, N it 1s customary 
for every examinee to respond to the same n items. Thus, it is. not sur- 
prising to find situations where (3) gives unsatisfactory results. / 

VJb^p trying to find a -probability function that gives a good fit to 
data, probably three of the best knftwn and rnost frequently employed dis-*' 
tributions are the binomial, Poisson and negative-binomial (Johnson and 
Kotz, 1969). Thus, when the beta-binomial model is unsatisfactory, it, is 
reasonable to consider replacing '(3) with a Poisson or negative-binomial 
distribution. Of course, the Poisson distribution is not new to psycho- 1 - 
metric theory (Lord and Novick, 1968, chapter 21), and it frequently gives 
good results when a particular event occurs infrequently. The negative- 
binomial distribution is' usually the first choice when the Poisson dis- 
•tribution is believed to be inadequate (Johnson and Kotz, 1969, p. 125). 

The Gamma-Poisson Model ^ > 



Let w=n-x and z=n-y be the "number of incorrect responses given by 
an examinee on the first and second test forms, respectively. We begin 
by replacing (3) with the assumption that the probability function of w, 
as well as z, is Poisson with parameter x\. Symbolically 

h ■ ' 

f(w|n) = e'V/w! , '(4) 

• * 

The reason for working with w and z, rather than x ajid y is that the data 
in our example is, skewed to the right. If the, v 5k£ervecf frequencies had 
been skewed to the left, we would hav& used x an,d y* 

We also assume ^hat for the population of examinees, n has a gamma 
distribution. The motivation for this assumption is that it is typically 
made for th^ Poisson case, ^is mathematically convenient, and it has 



199. 



given. good results with mental test data (Wilcox, 1981). If f(w|n) and 
f(z|n) are assumed to be independent, results in Aitchison and Dunsmore 
(1975) tell us immediately that 

4 



f(w) = 



/ 



_T(a+w) 

r(a)r(w+l) 



f 6 1 


w fi] 




l'3+lj 



(5) 



i.e., the marginal probability function of w is negative binomial; The 
parameters a and 6 can be estimated as follows: Let w and s be the 
sample mean and variance of w f-or a random sample of examinees. Then 
B=(s 2 /w)-l and a=w/e estimate 6 and cr respectively, ttiree other esti- 
mates of a and g are also available (Johnson and Kotz, 1969). 

Again referring to Aitchison and Dunsmore .(1975) , we have that 



f(2 ' wj --TGrfwHz+fl p+TJ [2FTJ 



a+W^ 



(6) 



Since f(w,z)=f(w)f(z|w), we have an estimate of P once a and 6 are determined 

The Gamma Product-Ratio/PDtsson Model 

The other mode^we consider also assumes (4), but n is assumed to have 
a "gamma product-ratiV' distribution (Sibuya, 1979). In this case , 



(7) 




r 


[w+A)r(8+Y)r(w+6)n 




.*r(w+l)r(a)x(6)r(Y)n 





where wl^my^ are unknown parameters. We note that two alternative names 
for (7) are generalized Waring and negatiye-binomial beta. Also, the 
parameters a and e in (7) are different jjCm those in (6). 



j 



200 



4 

To estimate a, 3 aird y» we first note that the first three factorial 
moments are' 



V 1 = aB/(Y-D 



y 2 = a(a+l)B(e+l)/[(rl)( Y -2)]- ' 

y 3 = a(o+l)(o+2)e(6+l)(e+2)/i:(Y-l)(Y-2)(Y-30 



It follows th?t 
\ 1 



S7 " y i 



Y - a - 3 = ~, 



v x +1 



and 



(8) 
(9) 
(10) 



(11) 



/- 2a - 23 = 



3yo 

— - y, + 4 



5 (12) 
an estimate of 



Thus, if is the usual estimate of ^ (1=1,2,3) s we have arye 
y, say y- Substttutfng Y ajid Jlj and y 2 into equations (8) and (9) yields 

,-1 



a = y, (y-1)3* 
f • 



(13) 



. a+3 = f {y-2)- Vl G-l)-l 



(14) 



Substituting the right-hand side of (13) for a in (14) yields a quadratic 
equation for 3. In terms, of the marginal density (7,1, either estimate of 
3 can be used since the other estimate of % will correspond to cf, and- 
since (7) is symmetric in a and 3. ' . < 

Finally, to estimate P with equation (1), we note that 



\ 



f(v r\ - . r(«-Hv)r(a+z)r(3+Y)r(8^Hz)r(2a+Y) f.rs 

nw>z; " r(a)r(a)r(w+ijr(z+i)r(3)rtY)r(2a+3+Y.+2+w) Ub; 



201 



One way to establish this resOlt is to assume f(wje) is negative-binomial 
and that g(e) is beta (which is equivalent to assuming (7)') and/then perform 
the integration in (2). % t * , " 

' 3. NUMERICAL ILLUSTRATIONS 

This section uses real data to illustrate the practical advantages of 
estimating P with the two alternative estimates described above., 

First toe cons iderthe data reported in Keats (1964)* As previously 
indicated, the beta-binomial model gives a poor fit to the observed test 



scores, but, as noted in Wilcox (1981), the gamma-Poisson model gives a 
reasonably good fit. Jhe test had n=30 items, and Keats reports observed 
test scores for 1000 examinees. If we estimate P with the beta-^inwirial 1 
model , the results is .90. If we use the ganma-Poisson model, the estimate 
is .81. The third estimate of P does not apply since the estimate of 
the parameters in (15) are inadmissible. Note that the re-liability esti- 
mates used by Subkoviak (1976) as well as Marshall and HaerteMl975) also 
assume the binomial error model holds. Since the beta-binomial model gives 

a poor fit to data, there is some doubt about whether these estimates should 

> 

even be considered. 

As another illustration, suppose* v/e have an n=15 iteirptest with a 
passing score of Xq=10, Further suppose we have test scopes as reported 
in Table 1, These results are based on real data reported in, Irwin (1968) 
but they do not represent tests scores. The point * is that we might gfet 
observed frequencies that are skewed, as are the frequencies in Table 1, 
•in which case it might be better, or even necessary to replace the beta- 
binomial model with something else. 



202 



For the data in Table 1, ^the (estimates of the parameters fn the beta- 
binomial model are negative, and so an estimate of P cannot be made. Sup- 
pose instead (7) holds. It follows that N a=5.2162, 3=1.297 and r=7:7967? 
Thus, the estimate of*P is .'97. If instead we use the gamma -Poisson model, 
the estimate of P is again .97. 

CONCLUDING REMARKS .' • 
The main point in the paper ^is that the beta-binomial model might 
give a substantially different estimate of reliability relative to some 
ottiBT^mondel -that gives a better fit to data: — We tttastratecT two' possible 
solutions, but virtually any form for fUJeKcan be used to estimated 
via equation (2) as long as an estimate of gf e) carr be obtained. 



203 



■ 

\ ' . REFERENCES 

*; Aittehison, J., & Dunsmore, I. R. Statistical prediction analysis . London: 

-• ' . JCambr^dge University Press7*",1975. "~ v 

Algina, J., & Noe, M..J. A study of the accuracy of .SUbkoviak's single 
administration estimate of/ the coefficient of agreement using two 
true-score "estimates. Journal of Educational Measurement , 1978, 
15, 101-11Q.* . ,v * 

Cohen, J. A coefficient of agreement for'nominal scales. Educational and 
' • • . Psychological Measurement , J960, 20, 37-46. 

Divgi, D. R. Group dependence of some reliability indices .for mastery 
tests.* Applied Psychological Measurement, 1980, 4, 213-218. 

. Griffiths, D. A. Maximum likelihood estimation for the beta-binomial- 

f 

distribution and an application to the household distribution of 

the total number of cases'of a .disease. Biometrics , 1973, 29, {637-648. 

/ * 
Gross, A. L.,-& Shulman, V. The applicability of the beta-binomial j for 

■ » / 

criterion-Yeferenced testing. Journal of Educational Measurement , 

.1980, 17, 195-202. 

Hambleton, R. K., Swaminathan, H., Algina, J,, & Coulson, D. Criterion- 

' referenced testing and measurement: A review of technical issues - 

and |developments. Review of Educational Research , 1978, 48, 1-47. 

Huynh, H. On the reliability of decisions in domain-referenced testing. 

JoiVnal t)f Educational Measurement , 1976, 13, 253-264* 
~ ~~ " — i * • 

Huynh, H. Statistical! Inference for two reliability indices in mastery 

> > 

testing based on the beta-binonriaj model . Journal >of Educational 

j, * » 

•Statistics , 1979, 4, 231-246. 



ERIC ■ / " 204 



Irwin-, J. 0. "The generalized Waring distribution applied to accident 
w ' L \ 

data. Journal of the Royal Statistical Society ,! 1968, 131, Series A, 

' f ' • ■ \ 

205-225. * . , « 

Johnson, N. , & Kotz, S. Discrete distributions. New York: Wiley', 196*9. 

Keats, J. A. Some generalizations of a theoretical distribution of mental 

test scores. Psychometrika , 1964* 29*, 215-231. 

Lord, F» M. , & Novick, M. R. Statistical theories of mental test scores . 
* f ■ '. 

Reading, Mass.: Addi son-Wesley, 1968. 

. Marshall^. L.., &Haertel, E*. H. A single-administration reliability 

index for criterion-referenced tests: .The mean split-half coeffi- 

'.cTl^t of agreement. Paper presented at the Annual meeting of the < 

American Educational Research Association, 1975. 

Sibuya, M. Generalized hypergeometrie* digamma, and trigamma distributions 

Annals of the Institute of Statistical Mathematics, 1979, 31, 373-390. 

. . } 

^'Subkdviak, M. J. Estimating reliability from a single administration of 

' a criterion-referenced test. Journal\f Educational Measurement , 

' — — - 

1976, 13, 265-276. . 1 

\ * • 

*. Subkoviak, M. Depisi'on-ponsistency approaches.^ In R. Berk (Ed.) 

Criterion-referenced measurement: The state of . the "art. Baltimore: 

The Johns Hopkins University Press, 1980. 
T-raub, R. , & Rowley, G. L. Reliability of test scopes and decisions. 

Applied Psychological Measurement , 1980, in press. ' v 
Wilcox, K R. Estimating the parameters vtfTthe beta-binomial distribution. 

Educational and Psychological Measurement * 1979, 31> 527-535. 
Wilcox, R. A review of the beta-binornial model and its extensions. 

Journal of Educational Statistics , 39£Q^ to. appear* 




ANALYZING THE DETRACTORS OF MULTIPLE-CHOICE 
^^IlSt ITEMS OR PARTITIONING MULTINOMIAL y 
CELL PROBABILITIES WITH RESPECT TO A STANDARD 



Rand R, Wilcox 



CENTER FOR- THE STUDY OF EVALUATION 
Graduate School of Education 
University of. California, LosAngeles 90024 

. ' and the 

DEPARTMENT *0F PSYCHOLOGY 
University of SouthenTCalifornia ' 
Los Angeles California 90007 



r 



c 




The work upon which this publication is-based was performed 
pursuant to a grant [contract] with, the National Institute 
of Educatibn, Department of Health, Education and Welfare. 
Points (\f view or opinions stated do not necessarily repre*- 
sent official NIE position or policy. ' 



er|g 



206 



' ABSTRACT 



i 



When analyzing the distractors of multiple-choice' test items, 
. it is sometimes desired to determine which of the distractors has a small 
• probability of being chosen by a typical examinee. At present, this prob- 
lem is handled in an informal manner. In particular, using an arbitrary . 
number of examinees, the jvrobab-iTi ties associated with the distractors 
are estimated and then' sorted according to\hethermie estimated values 
are above or below a known constant p Q . In this paper a more formal frame- 
work for solving this problem is described. The first portion of the paper 
considers the problem from the point of view of designing an experiment. 
'The solution is based on a' procedure similar to an indifference zone for- 
mulation of a ranking and' -election problem. A later section considers * 
methods that might be*employed in a retrospective study. Brief considera- 
tion .ife.)also giyen to how an analysis ,mi$ht proceed whefia test item has 
. been altered in some way. ' 



.4 



KEY* WORDS: indifference zone; empirical Bayes; 



207 



Consider a multiple-choice test iter? havin#-k+l .alternatives from wiyich 

i 

to choose. One of thesa alternatives, is designated as being correct and the 
.remaining k alternatives are referred to jis distractors. Henrysson (1971, pp. 
136-137) suggests that a statistical analysis of the distractors might be made 
as follows: Administer the item to r'randpm sample of n examinees; if the ob- 

served frequency corresponding to, a particular distractor is small, perhaps 

• ■*• i 

it should be replaced or rewritten. - 

' ' Henrysson* s procedure certainly seems'hike a reasonable one and in fact 

. .♦ - 

it is often used. A proposed distractor might appear to be satisfactory but 

*- ' - i - - 

in reality it 'might be infrequently chosen by examinees who do not know the 

correct response. It is only, natural then to conduct an empirical investiga- 

r 

tion to determine when this' occurs. Insofar as we want to discover whether 
an examinee knows the correct response, rewriting or replacing the distractor. 
might be in order when the data suggests that it is seldom chosen. .The idea, 
is tottbdify the distractor in the hope of lowering the probability of*guessing 
the correct response. It should be. stressed, however, that if any or all dis- 
tractors are infrequently chosen, this does, not necessarily mean that the dis- 
tractors should be replaced. .If, for example, all of the distractors are sel- 
dom chosen, it may be that most examinees know the answer in which case the 

item might be acceptable for certain types of achievement tests while for other 
situations (e.g., Lord and Novick, 1968,. p. 320) the item mighfbe discarded 
altogether. The statistical techniques described here are merely meant to 
alert a test constructor to the possibility" of improving the distractors * 

Let p. (i=l,.-..,k) be the probability 'that a randomly selected examinee 
chooses the itji distractor. For convenience, the (k+l)-th alternative is 
assumed to be the correct optioo. Thus, p k+1 is the probability of a correct 
response by a randomly chosen -examinee; 'consistent with Henrysson (1971)\ 
suppose that for eacn^distractor we wan/ to determine whether p. is less/ than 
or greater than some known constant p Q . If p.<p Q , the vflue of p i is said to be 



' . ' . 208 



|man and consideration is given to rewriting or replacing the distractor. * If 
^Pi>P0' no - Action is taken. A common value for p Q appears to be' .1 although 
ipther values are certainly possible. 

Let x- be the number of examinees who choose 'the ith distractor. Since 
f/n estimates p^ a natural decision rule (and the one that is used) is to 
ykje'the p-^pg.if x i /n<p Q ; if x.j/n>p 0 the reverse decision is made. A correct 
d&ision for all k distractors is made if simul taneously x i /n<p Q when P^Pq and 
xln^pg when p^>p 0 (i=l,...,k). The difficulty is that because of sampling fluc- 
tuations, we might observe an xij that results in an incorrect decision. For 
ej&mple, we might observe x t /n>p Q when in reality p^Pg. Accordingly, when-using 
H^nrysson's procedure, we need to consider the following types of questions. How 
many examinees should we sample to be reasonably certain. of making a correct de- 
cision for all k distractors regardless of the actual values of the p^s? This 
type%f question occurs when designing a study of a proposed item, i.e.., 'prior 
to fleeting any data. In contrast, once data is available, one might 
conduct a retrospective j^udy and consider, the probability of making a 
correct sort of the distractors for the "typical" item under consideration. 
StilT another type of problem that might be. considered is determining the 
effect of rewriting or replacing a distractor. In the present context we 
would want the new value of say pf, to be greater than p Q . At a minimum, 
we want to he at least as large as p.. Thus, the question might arise as 

to how certain jrfe can be that p: is less than or greater thari p. based on 

■ r- i 

the number of examinees that are sampled. If pT<p i , the tjriginal version 
of the distractor should be used; if P :> Pi , the new version is described as 
improving upon the old. The purpose of this paper is to provide an approach 
to these problems. 

From a statistical point of view this paper is concerned with comparing 
multinomial cetf probabilities to a standards with comparing Binomial T 



209 



distributions to a control. For related results on this type of problem the 
reader is referred to Gibbons, 01 kin and Sobel ( 1-977, "thaptejg 10), Fhaner 
(1974), Huang (1975)', Tong (1969) and Wilcox (1979a, 1979b). 

2. Mathematical Statement of the Problem 
. For a random sample of. n examinees (sampled from an infinite population 
or a finite population with replacement) let x = (x, ,. . . ,x. ) be the ob- 
served frequencies among the k+1 alternatives- The random vector x has a 
multinorfrcil distribution given by 

k+1 

f(x) = : n! a fyx ■ ' 

i=i v 1 

* 

where Ex. =%and zp- =1. Let p n be a known constant* The first goal is to 

j l f 1 u 

determine for each p. (i=l ,. . . ,k) whether p. is above or below p Q . As previously 
indicated, the decision is made if x../n>p 0 "; otherwise the reverse is 

s^'id to be true. Let g, 0<g<k, be the number of' p.'s su$i that p.>p n and 
for convenience (and without loss of generality), suppose that the p.'s (i=l,..., 
are ordered, i.e., P q <pX. .<p. . As already noted, in tenrts of the x.'s, a 
correct decision (CD) is made if simultaneously 

(2.1) x../n<p 0 , i=l,..;,k-g • 

and • * "* 

(2.2) ; x./n>p 0 , i-k-g+l,...,k. ♦ 

The problem is to find the smallest n, say n Q , so that regardless of the actual 
values of the p^'s, the probability of a correct dee^fslorn^a^j^alue reasonably 
close to one. More briefly, we wa y nt to find the smallest n so the 

(2.3) P(CD)>P*, 

» • t ~< * 

3 \ 



210 



where* 2~ k <P*<l 0 

Following Gibbons et al. (1977), an indifference zone formulation of 
the problem is use*. Thus, the investigator is assumed to have chosen a con- 
stant 6* with the idea that if P 0 ~$*<P.j<PQ+<$*> there is negligible loss in 
misclassifying the ith_ distractor. In fact, if the value of p. is in the open s 
interval (Pq-<5*, Pq + $*)> an ^ decision for that distractor is designated as 
being correct and so a correct decision is made with probability one. Thus, 
our only concern is with values of p..<pQ-6* >and P^g* 6 ** 

3, An Exact Solution 

In this section an exact solution to the problem of determining is 
described. First we observe that theY(CD) is a function of the unknown p. *s. 
Thus, for ajiven n, it might be that the P(CD):>P* for some values of the p^'s 
but not for others. To be certain that (2.3) holds for any vector £=(p-j ,p^ + -j 
we consider, as is typically done, the worst possible case, namely, the p. 

values, say £°=(p° P^+i^ tnat minimizes .the P(CD). It is shown below 

that'£° does not depend" on n. Hence, by choosing the smallest n so that P(CD| 
£*£ )>P » (2.3) is guaranteed regardless of the actual values of the p?s. 
To avoid certain technical difficulties, it is assumed that k(p Q -6*)<l . This 
is not a serious restriction for the problem at hand since typically Pgi.2, 
.01<6*<*1 and k<4. { ) 

Our immediate goal is'to show that £° is given by P^Pg-fi* ,(i=l,. . .,k-g) * 
and Rj 0= pQ + 6* (i=k-g+l,. . , ,k). First, however, some preliminary results are 
needed. Accordingly, we begin by demonstrating that for fixed g and n, 
(3.1) P(x i /n<p Q , i=l,...,k-g) ^ 

is minimized when P 1 =P 2 =. .^p^Pq-s*. Since by assumption (k-g) (p Q -6*)<l, ' 
the possibility of having t5-|= =p k-g =p 0" 5 * is ensured - 



211 



. Let s be the smallest integer greater than or equal to np Q and 
k-g 

E 

j=i+l 



let.p. = k £ 9 'p.. Olkin and Sobel (T965) show that (3.1) is equal- to 



' r M) /'^ /~V--Vg-(T t ;- s o V^" 1 k ; 9 «. 

r k - g (s)r(n-s o+ l) Pi P 2 -\- g °- t 0 ) A 1 . i5l 1 

/ . 

where r is the usual gamma function, s n =(k-g)s, t n = z 9 t. and t rt , t,, 

...»t k _ g are dummy variables. Note that this «qua*ntity depends only 

on (p r ...,p k _ g ). Examination of the limits of this (k-g^fold integral 

reveals that among all "vectors (p ] . . ,p k _ g ) for which P^Pg-6*, (3.1) 

attains its minimum value when 

(3.2) Pr-. -P^Po-S* 

as was to be shown. 
Next consider 

(3.3) , P(x k-g+1 > np 0 ,...,x k > np Q ). 

From Olkin and Sobel (1965) we see that this probability is equal to 

< 3 -4) ^ Lin+U , p k-g+l... A ,,\ ^ g s-1 g 

q, " Z f C) , f n U"*n) n t* n dt. 
r*(s)r(n-gs+l) u 4 0 0 i=i 1 , =1 i 

g 

where now t Q = ^ t. and again t Q , t r ...,t g are dummy variables. From 

(3.4) it follows that for fixed g and n, among all possible values of 
p i - Po + 6 *( i=k -9 + l»...»k), expression (3.3) is minimized when 

(3 * 5) p k-g+r-- = Pk =p 0 +6 *- " 

The above results are'now extended to show that for any n and any admissi- 
ble g, 



5 , 



ERIC ' ■ 212 



P(CD) = P(x<s x. 

is minimized when 

(3.6) , Pr.'-.=P|^ = Po- 5 * and Pk-g+r*-* =p k =p 0 +5 ** 

The vector £ that satisfies jjhe t^o conditions given by (3.6) is referred 
to as the least favorable configuration of the p^'s. 

First note that 

(3.7) P(CD) = EEP( Xl x k _ g ) P(x k . g+r ...,x k ix 1 x k _ g ) 

where the first summation is over all vectors (x ] ) such that 

xF< s (1=1 s . ,k-g) and the second is over all vectors (x k _ g+1 X(( ) 

such that x . ± s (j=k-g+l k). It can be verified using standard tech- 
niques that 

P ^ x k-g+l x k' x l »••• * x k-g^ 

is a multinomial distribution given by 

X X 

yn Xl ... x k _ g ;. ...F k> U-pj-...-p k ) 1 k 

* x k-g+r-** x k :(n " x l"*--" x k ): (1 -Pr**-' p k-g )n " Xr *"" Xk "9 

X X 

_ T (n-x r ...-x k , g )' r k j|ff ...r |c k (1-p 1 -...-p k ) n - x r"- x k , 

' Vg-r , **V (n ' x i"-"" x k )i (1 "Prv"Pk- g )n " Xl "'**" Xk 

where r i =p./(1-p 1 -...-p k g ). 

Thus, making the appropriate modification i^(3.4)-*or referring to 01 kin 
and Sobel (1965) the second summation in (3*7) can be written as 

. r(n-x,-...-x. +1) r, r. ' n-x,-...-x. -sg k 
r 9 (s)r(n-x r ...-x k _ g -gs+l) 0 0 y> j= k -g+l 
k 

n dt, . ^ 
j=k+g+l 3 x ~* 



213 



where t^= ^ 1 tj and again % Qt t fcfg+1 ,,.C,t k are -dummy variables. 
Examination of the^ limits of this g-fol'd integral reveals that for fixed ' 

x l x k-g' *V'"' p k-g* the second summation in (3.7) is minimized when 

p k-g+l = * * * =p k =p 0 +6 ** This in turn ln, P lles that for fixed p 1> ...,p k the 
P(CD) as given 6y (3.7) is minimized when- (3.5) holds. 

Next, set P| c _ g+ -]=. ..=p k =p 0 +6*. Since by assumption it is possible 
to have P-j=. . .=p k _ g =p 0 -6*, it follows, using an argument similar to the 
one in the preceding paragraph, that the P(CD) is minimized when (3.6) 
holds. Hence, by choosing n Q to be the smallest Integer such that the 
P(CD) >_ P* for all admissible values of g under the least favorable con- 
figuration (3. 6}, ^guarantee (2.3) no matter what the values of the p^s 
happen to be. < 

Exact and Approximate Methods for Calculating n Q . 
Tables 1-3 give the value of n Q .for p Q =.l, <5*=.05; p Q =.15,.2, 
6*=.05, .1; P*=.77, .9, .9.5, .99; and k=l(l).3. If, for example, 
k=2, P 0 =.l, 6*=. 05 and P*=.9, n=HOexaminees guarantees that the correct 
sort of the two distractors will be made with probability at least .9 
regardless of the actual p.. values. Thts section describes exact and 
approximate methods for determining n Q . 



) 



A Lower Bound to n Q £ 



'There might be occasions where it is helpful to have a lower bound 
to n Q that is easily computed. Accordingly; let I tk,p»s .n^ represent 
the value of (3.4) when Pp...,^ have a common vatue p. This is 
also the P(CD) for the least favorable configuration when g=k* It can be 



214 



seerr that the smallest'n, say i>^, such that I(k,p,s,n.j )> P* is a lower bound 
to rig whenever p >"p Q + «*.' Sobel, Uppuluri and Frankowski (1977) have 
tabled the values, of I(k,p,v,n) for p=t _1 ; t=k+l(l)10, and v=l(l)10 which" 
can be used to determine a lower bound to n Q by referring to the entries 
for the smallest p> p Q + 5* and the largest v < s.' For example, if k=2, 
P 0 =.l, 6*=.0S, P*=.9, then the smallest p >. p Q + 6* = .15 in their Table 
B is p ■ 1/6. Examination of the entries in their table reveals that with 
n=6Mwhich implies that s=7)l(2, 1/6, 7, 68) = .9008 and so'n Q > 68. 
Thus^, for this particular case, n Q can be determined exactly by starting 
with n=68, evaluating the P(CD) for g=0„ 1, 2 and checking whether P(CD)>, 
P* for all three values of g. If P(CD)<P* for any g,the value of n is 
increased by one and 'the process repeated until (2.3) is attained 1 for all' 
three values of g. . 

Method of Calculating n Q for the Case k=l 

We first discuss the determination of ng for the special case k=l. 
This situation has already been considered by Fhaner (1974) and Wilcox ' 
(1979). In particular, n Q is the smallest integer n so that simultaneously 

and % • 

2o ( I )(p o- 6 *> x (i-p 0 +6 *) n " x i p *- 

* 

These two quantities are fairly inexpensive to evaluate on a computer* 
even, for n > 500. They can also be calculated via the relationship 

«i .p.s*) - ! (> x o- P ) n - x 
i 8 

215 



"where I(l,p,s,n) is the usual incomplete' beta function. It has also been " 
shown that an approximate value of n Q is given "by \ Z P 0 (1-P 0 )/(S*) 2 where A 
is the'-P* quantileof the standard normal distribution. 

The Case k=2 ' 

For' k=2 there are three values of g that need to be considered. For 
g=2, the minimum P(CD) is given byl(2,p,s,n) with p=p Q +6*. From Sobel 
et al. (1977, p.8) 



I(2,p >S ,n). j & (pp y (l-2p) n - y fi 5 (»)]. 



z=s 



(It might appear that the term p y .(l-2p) n ~ y should be either (2p) y (l-2p) n_y 
or p^l-p)""^, but from Sobel et. al it can be seen that this expression 
is correct.) For g=0, th^ minimum P(CD) is given by 
(3.9) J(k,p,s,n) = z (-D y Al(y,p,s,n) 

with p=p 0 -6 where l(0,p,s,n)=l. In fact, from Sobel and Uppuluri J1974), it 
follows that for g=0, the minimum P(CD) is given by (3.9) for any k. For 
g=l, the minimum P(CD) is 

(p 0- 6 *) Xl (l-P 0 +6 *) n " Xl Hl.fp^^/d-pQ-fd^.s.n-x^. 

The last expression is obtained by writing the'p(CD) as is done in (3 7) 

An Approximate Solution for k > 1 . \ 

For k > 2, the necessary calculations to compute n Q become prohibitively 
expensive. In many cases, however, exact results are possible by first • 
applying the .approximate solution about to be described and then performing 
the calculations outlined below. 

The proposed approximate solution is based on the Bonferroni inequality 



9 

ERIC 



216 



• 1 

* 

which states that for any set of events *B^...,B m , • - . 

(3.10) P(n B.) >"l -z P(B?) 

c ' * \ 

. where B i is the complement of the event B.. Several othe»< approximate 

solutions were investigated that relied on the central limit theorem 

♦ * 

and various inequalities for the multivariate normal distribution.' However, 
the procedure proposed .here is relatively easy to use, it is inexpensive, 
and it is surprfsTngly accurate. 

Familiarity with the multinomial distribution suggests that when p Q 
• is close to zero, as is typically the case-'for the problem under investiga- 
tion, theP(CD) is a minimum when g=k. 'Conditions under which this is true 
are not known: In all cases considered, however, it was verified that this 
is indeed .the case. Fortunately it is possible to arrive 1 at this conclu- 
sion for the special cases considered here without calculating the exact 
value of the P(CD) for every g. This point is illustrated below. 

Let n] be the smallest integer such that (3.8) is greater than or 
equal to T* where T*=l-(l-P*)/k. We consider n, as a first approximation 
to n Q . As alluded to earlier,, our' main motivation for using n, to approxi- 
mate n Q is the high cost of 1 determing n Q exactly for k>3. Before 
considering this case,, it is of interest to examine the accuracy "of the 
approximation for k=2." 

'Table 4 gives the value of n ] for k=2 and the values of P* and 5* 
used in Table 2.- As can be seen, ^ gives a good approximation to n Q . 

The Case k=3 

The first step used to determine n Q exactly for the 'case k=3 was to 
-compute n 1 in the manner described in the previous section. The results 

i 

10 V 



r 



r 



/ 



ERIC ,217 



* 



are reported in Table 5. Next, using the value of n 1 the value of I(3,p 0 +6 
s,n.j) was calculated. This was accomplished "with the' reduction formula- 

(3.11) . I(k,p,s,n) - !"* (J)(l-p) y p n - y I(M. p/(l-p),S,y) 

, , - -. y=(k-i)s y . - s 

given by Sobel et al. (1977, p. 8). Jhe value of was then 'adjusted to 
find the smallest value of n^ say n 2 , so that I(k,p 0 +6*,s,n 2 ) *>_ P*. A com- 
parison of Table 5 with Table 3 shows that frequently n Q =n 2 and that typically 
the value of n 1 is within one of tfie^value n Q . •» 

Finally, to verify that n 2 is sufficiently large to satisfy "(2.3) , 
i.e., that n Q =n 2 , we calculated I(i^P 0 +6*,s,n 2 ) for i=l,2 and* J(i,p 0 -6*,s,n 2 )„ 
for i=l,2,3. As previously pointed out J(i ,p 0 -6*,s,n 2 ) is the probability 
of correctly classifying, the i distractors having probability p=p Q -6* of t 
being chosen by a randomly selected examinee. These values were then used 
in conjunction with the Bonferroni inequality, to show that P(CD) > P*. 

As an illustration^ sons ider the case k=3, P Q =.l, 6*=. 05 and P*=,95. 
The value of n, was found to be 199 and it was verified via (3.11) that 
n=199 is -the smallest sample s.ize so that P(CD)>P* when g=k Call distrac- 
tors have a probability of being chosen by a typical examinee that is greater 
than thevStandard p Q ). Consider, for example, the case g=l 0 It was found * 
that I(l,p 0 +5*,s,199)=.996,,and that J(2, p 0 -5*,s,199)=.995. As explained 
'earlier, the first quantity is the probability of making a correct decision 
for a distractor having p=p Q +6* and the second quantity is the probability 
of a correct decision for two distractors having p=p Q -S*. Applying (3.10) 
it follows that the joint probability of correctly classifying all three 
distractors is greater than or equal to !-(!-. 996)-(l£.995)=. 991 . Thus, the 

, * 

11.'' 

* 

218 



■ > 



the desired probability guarantee is satisfied for this special case. Pro- 
* 

ceeding in a similar manner,ot can be seen that n s 199 is sufficiently large 
for g=2 as well* 1 

The Ca%e k=4 * - ' 



Yhe last situation considered is k=4. In this case the value of n n 
was approximated in the manner previously described, but no attempt was made 
«,to make an exact evaluation of the P(CD) under the least favorable configura- 

1. 1 fV 

tioru However, chefcks were made on the adequacy of n n with a normal* approxi- 
mation to I(k»p»s»n) given by A 

(3,12) A k (p,h) =* ,i (h)+ < >(^H 2 (h)* k " 2 (h)+ I ujS)h?* 2 (h)J*" 2 (h)-6(jj)^hl 

* k " 3 (h)+6(^ 4 (h)$ k - 4 (h)}, ; . 

whera p=^s(n-s+l )" and h=2 (arc^sin \r r arc sin [s/{n+l)] 2 )(n+2) % $ is the- 
standard formal cumulative distriubtion function and ^ is the standard normal 
density function. This approximation was proposed by Sobel et al/ (19?f , -sectiorv 
2,4) who claim that it generally gdves better results than the ndrmal approxi- 
mation to the discrete multinpmial distribution. 

Table 6 gives the resulting values of n Q f&r k s 4. Using (3.12) in ' * ■ 
conjunction with the Bonferroni inequality, an approximate lower bound to 
the^jCD) was also determine* for each n n . These values are reported 



/ 



4. A Lower Bo und to the ?(CD) for a Typical Item. 



I 



In thw/ section we describe how a retrospective study might 'be conducted 
to estimate a lower bound to the P(CD) for a typical item under study. 

* f 

Before doing so, we note that once observations are available it. is also 
of interest to obtain a point estimate of the P(CD) for a typical item and 
that under certain circumstances a theoret^al solution to this problem 

exists. For example, we might assume that p^ P k arise from a Dirichlet 

distribution th"e parameters of whicf} can be estimated in the manner described 
by Mosimann (1962). However, there remains the practical difficulty of 
evaluating the P(CD) once an expression for it has been obtained. For this 
reason we do not discuss this problem further. 

Al though "there are difficulties with obtaining a point estimate of 
the P(CD) for a typical test item, it is f,airly easy to obtain a lower bound 

m 

to this quantity by proceeding in the manner about to be described. *It 

„ 

is assumed that observations are available on N items under investigation. 
Consider the first distractoAof every item having probability Pw(j-1,...,N) 
Of being chosen by a typical .examinee. Let h^p) be the marginal distribution 
of p }y No assumption is made about the. form, of h; it is merely assumed 
that the first two moments of h exist* Assuming the conditional distribution 

* « 

of Xj.. is binomial for a given p fj „ we can estimate thelnean-and variance 
of JJy over the domain of items,- say V and o 2 , with 

' l '** ...... 

v = (Nn) "1 zx,, . * 

and 



* 2 a a 2 



13 



Ejlc . - ' 220 



where 




(Lord and Novick, 1968, p. 521).- , 

Henceforth we, assume y and-o are known. Let 



P n » if v < P, 



0 



M» If P ft <. y <, 1 



and 



2 



U 



a 



if 0 < cj < m 




= (p(l-p)-a )/(l-p 0 )p 0> otherwise 
where 

m = max{p(p 0 - y ), (y-p 0 )(l-p)}. 

Let e 1 be the probability of a false-negative decision for the first 
distractor of a randomly chosen item, i.e., Uf PCx^s.p^pJ. 
Using results given by Skibinsky (1977), Wilcox (1979C) shows that 



The details of the argument are given by Wilcox and so they need not be 
repeated here. Let a^Pfx^s, , p } <p Q ). It can also be shown that ' 



*1 * U '£ M n- Po ) 



n-x 



n-x 



14 



221 



where U 1 is the value of U'wfth 5 replaced witfp 

S-j = y, if ]x < p Q • 

8 P 0 » if P 0 < v < 1 

Using the ajbove' procedure, we obtain an upper bound to -a. and &.\ say 
a. and 6., for i=l,...,k. From the Bonferroni inequality it follows that 

^co^i-z^+i.) : . ) 

1 • * 

It is also of interest to note that a lower bound to the P(CD) can be 
determined for a given 6*>0. The interested' reader is referred to Wilcox (1979C). 

5. Comparing Two Binomial Probability Functions 
As-pointed out in the' introduction to this paper, there may be situations 
where an investigator is interested in ascertaining the^effect of a particular 
modification to a multiple-choice test item under study. It was further sug- 
gested that this problem might be formulated in terms of comparing a binomial 
probability function to a control. That is, there are two binomial probability 
functions having probability of success p'and p and the goal' is to determine 
whether p' <. p. A solution to this' problem is given by Wilcox (1979b). Here 
we extend this solution to cases where we* want to determine whether p' < p t 
c where c is a constant specified in advance" by the investigator as being 
appropriate for the situation at hand. In other words, we want to determine 
whether the difference between p'and p is reasonably large. 

*> " 

j 



15 



222 



■ Let x and y be the observed number of successes corresponding to the 

populations having probability of success p' ai$p, respectively. The deci- 

-1 -1 ' : 

sion p'<p+c is made if n x<n y+c«, otherwise the reverse is said to be 

true. 

As before, an indifference zone formulation of the problem is used. 
In this case the indifference zone consists of the open interval .(p+c-<5*, 
p+c+6*). If p+ c -S*<p-<p+c+5* the investigator is j/ot particularly concerned 
about which decision is made. If p'<p+c-6* or if p' ^ p + c + 5* we want 
the probability of a. correct decision to be reasonably bi,gh. t 

Since the family of binomial probability functions has the monotone 
likelihood ratio property, it can' be seen that for fixed p, the minimum 
P(CD)"is 

p y-l+rncl 

(6J) y lo xio (> + os*) x (i- P ^6*) n - x (> y (i- P ) n - y 

or 

, (5 - 2) jo x-W] <> +cM *> x o-P-c-s*?-* { y d-p)"^, 

whichever is smaller, wh^re [nc] represents the largest integer less than 
or equal to nc. Thus, to guarantee that both (6.1) and (6.2) have a value 
exceeding P*,^"t is sufficient to minimize these quantities as a function 
of p and see whether the desired condition holds for a given n. If, after 
minimization, either (6.1) or (6.2) is less than P*, a larger value of n 
must be used. Table 8 gives the smallest required sample sizes for P*=,75, 
.9, '.95, .99; 6*=.l and c=0, .05, .1, .15. 

16 V 

* 

-• 223 



— Concluding RemarJcs 

■* 

The main result in this paper is that a researcher can solve the follow- 
ing type of problem. Suppose we have a multiple choice test item with k=4 
abstractors. Further sunf^se we want to determine which distractors have 
a probability of less than .1 of being chosen by a typical examinee, and 
simultaneously determine which have a probability of at least .1. A deci- 
sion about each distractor is "made based on a random sample of examinees. 
If the proportion of examinees choosing a distractor is less than .1, we 
decide the corresponding cell probability is less than .1; otherwise the 
reverse decision is made. What is the mini mum .number of examinees required 
so that regardless of the actual cell probabilities, a correct sort of the 
distractors is made with probability at least ,975 when an indifference zone 

of 6*=. 05 is used? From Table 7, the answer is h=269. If instead there aref 

\ 

. k=2 distractors, Table 2 says that at least n=235 examinees would be needed. 

While the original motivation for this papeV was to analyse distractors, 
an additional application of the results reported here recently came to the 

♦ 

t 

author's attention. Macready and Dayton (1977) illustrate how latent struc- 
ture models might be used to measure achievement. For the simplest case, 
we have two equivalent items for measuring. a particular skill, Two items 
are defined to be equivalentjf every examinee knows the answer to both or 
neither one* Let c be* the proportion of examinees who have acquired the skill, 
and let 6^=P(correct on the ith item | examinee does not know), i=l,2. Foh 
a randomly selected examinee*, the probability of a correct on theirs t item 
and an incorrect on the second is 1 

tP 10 = e x (i-e 2 )(i-t) 



17 

224- 



J 

and the probability of incorrect and then a Correct 4s 

Pqi " (1-B 1 )3 2 < 1 -?)- • 
If we assume &f& 2 <h, then p 1Q <%, and p 01 <%. Sijice p 1Q aftd p Q1 are cell 
probabilities of a multinomial distribution, a partial check on the model 
can be made by estimating p 1Q and p Q1 in the usual manner, and seeing 
whether the values are both less than Determining the number of exam- 
inees required can be accomplished with the results given in this paper. 



18 * 

225 



REFERENCES 

-FMner, S. Item sampling snd decisionmaking in achievement testing. 
, jffi 1 t1sh J ourna 1 of Mathematical and Statistical Psychology. 
$74, 27, 172-175. 

Gibbons, J., Olkin, I., and Sobel, M. Selecting and ordering 

populations : A neW statistical methodology. ^ New York: 

John Wiley, T977. 
Henryssorr, S. Gathering," Analyzing, and using data on test items. 

In R< L. Thomdike (Ed.) Educational Measurement, American 

Counci4^orr Education, msntngton, ; D. C, 1971. 
Huang, W. Bayes approach to a problem of partitioning k normal popula- 

ttons - Bulletin of t he Institute of Mathematics Academia Sinica 

1975, 3, 87-97. 

Keats, J. A. and Lord, F. M. A theoretical distribution for mental 

test scores. Psycftometrika . 1962, 27, 59-72. 
Lord, F. M. and Novick, M.* R. Statistical theories of mental t^t 

scores. Reading, Mass.: Addison - Wesley, 1968. 
Ma^ready, G. B. , Dayton, C. M. The use of probabilistic models in 

the assessment of mastery. Journal of Educational Statistic. 1977, 

2, 99-120. 

Mosimann, J. E. On the compound multinomial distribution, the multi- 

Z* te ^distribution, and correlations among proportions. 
tfika . 1962, 49*, 65-82. 
01k/n, I. and Sobel, M. 4 Integral expressions for tail probabilities 
Of the multinomial and negative multinomial distributions. 
BiWtrika , 1965, 52, 167-179. 
Skibinsky, M. The maximum probability on an interval when the mean and' 
variance are known. Senkhya , 1977, Series A, 39, 144-159. 

226 



Sobel, M., Uppulurt, V. R. R. and Frarikowski, K. Selected tables tn 
- mathematical - statist tes,-volnmeHW. Providence, Rhode Island: 
. American Mathematfcal Society, 1977. 
Sobel, M. and Uppuluri, V. R. R. Sparse and crowded cells and Dirichlet 

distributions. The Annals Of Statistics . 1974, 2, 977-987. 
Tong, Y-. L. On partitioning a set of normal populations by their 
. locations with respect to l a control. Annals oQathematical Statistics 

1969, 40, 1300-1324. 
Wilcox, R. R. Comparing examinees to a control. Psychometri ka , 1979, 

44, 55-68 {a). 

Wilcox, R. R. Applying ranking and selection techniques *to determine 

t 

the length of a mastery test. Educational and Psychological 
Measurement , 1979, 31, 13-22 (b). 
Wilcox, R. R. On false-positive and false-negative decisions with a 
mastery test. Journal of Educational Statistics, 1979, 4, 59-73 (c). 

i 




TABLE 1 

* 

Values of n n for k=l 



Po • 


'8* 


P*: .75, 


.9 


.95 


.975 , 


.99 


.1 


.05 


IS 


60 


no 


160 


239 


.15 


.05 


•25 


86 


153 


219 


313 


.15 


.10 


5 


- 20 


40 

> 


59 


86 


.2 


.05 


33 


109 


180 


260 - 


370 


.2 


.10 


9 


25 


45 


70 


100 



r 



4 

r 

! 

TABLE 2 

Values of n Q for k=2 ^ 

» 

1 



Po 


6* V 


P*: .75 


' .90 


.95 


.975 


.99 


.10 


.05 


49 


1J0 


160 


235 


290 


.15 


.05 


66 


153 


219 . 


292 


380 


.15 


.10 


19 


40 


59 • 


' 79 V 


106' 


.20 


.05 


84 


180 


260 * 


340 


455 


.20 


.10 , 


24 


45 


70 


90 


120 



V ~2Z8" 



TABLE 3 







Values of 


n 0 


for k=3 






Po 


* 

6 


P*: .75 


.9 


.95 


.995 


.99 


.1 


.05 


70 


140 


199 


250* 


* 

320 


• 13 


• UO 


l(JU 


i no 

192 


259 


333* 


420* 


.15 


.10 


26 


52 


72 


92 


119 


.2 


.05 


^20 


225 


305 


390* 


495* 


.2 


.10 


30 


60 


80 


105 


135 



Entries marked with an * were not verified using exact 
calculations of the P{CD). 



TABLE 4 
Value of n, for k=2 



p 0 


6* 


P*: .75 


.9 


.95 


.975 


.99 


.1 


.05 


49 


no 


160 


235 


2*90 


.15 


.05 


' 66 


153 


219 


292 


380 


.15 


.10 


19 


40 


59 


79 


106 


.2 


. .05 


89 


180 


260 


345 


460 '. 


.02 


.10 


24 


45 


70 


90 


120 



229 




TABLE 5 



Values of for k=3 




Po 


** 


* 

P : .75 


.9 


.95 


.975 


.99 


. 1 


.05 


79 


140 


199 


250 


320 


.15 


.05 


100 


193 


260 


333 


420 


.15 


.10 


26 


52 


72 


93 


119 


.2 


.05 


120 


225 


305 




*** 


.2 


.10 


30 


60 


80 


105 


135 






TABLE 6 


n 

• 


* 






App 


roximate Values of n Q 


for k~4 






p 0 


* 


P*: .75 


.9 


.95 


.975 


.99 


.1 


.05 


99 


160 


219 


270 


349 


.15 


.05 


132 


219 


292 


360 


446 


.15 


.10 


33 


59 


79 


99 


126 


.20, 


.05 


155 


260 


345 


425 


525 


.20 


.10 


39 


64 


85 


105 


135 



s 




9 

ERIC 



TABLE 7 f 
-Values "of n^orWUMn^lXW 



Po 




* * 

P*: .75 


.9 


.95 


.975 


.99 


.1 


.05 


99 


160 


219 


/ 

269 


348 


.15 


.05 


132 


219 


292 


359 


453 


.15 


.10 


33 


V 59 


79 


99 


125' 


.20 


.05 


155 _ 


2£0 


345 


425 


540 



r 



TABLE 8 



Values of n for comparing a binomial 
distribution to a control, 6*=.l 



c 


P*: .75 


.9 


.95 


.'99 


0 


32 


91 


, 144 


245 


.05 


41 


101 


161 


261 


.10 


41 


101 


151 


261 


.15 


34 


94 


* 141 


254 



9 

ERJC 



231 



To appear in Applied Psychological Measurement 



(A) V^., 



' SOLVING MEASUREMENT PROBLEMS WITH AN 
ANSWER-URTR- CORRECT SCORING PROCEDURE 



Rand R. Wilcox 



CENTER FOR THE STUDY OF EVALUATION 
Graduate School of Education 
University of California . Los Angeles 

and the 

DEPARTMENT OF PSYCHOLOGY 
University of Southern California 



The project presented or reported herein was performed pursuant to a 
grant from the National Institute of Education, Department of Health, 
Education, and Welfare. However, the opinions expressed herein do not 
necessarily reflect the position or policy of the National Institute of 
Education, and no official endorsement by the National Institute of 
Education should be inferred. ^ 



r 



Sol V-ffcg . Weasurenftnt" Probl ems 

t ABSTRACT- -v ^pjjr 

Answer-until -correct (AUC) tests have teen Witt) us for sometime^. J 
..Pressey (I960) points to their advantages in enhancing learning, antT Brown 
( 1965) has proposed a scoring procedure for ifttfta-t appears to increase, 
reliability (Gilman and Ferry, 1972; Hanna,- 1975). This paper describes* 
a new scoring procedure for AUC tests that solves ^arioys^measurement 
problems. In particular, it makes it possible to check whether guessing 
is at random, it gives a measure of how "far away" guessing is from being 
random, it corrects observed test scores for partial information,, and it 
yields- a measure of how well an item reveals whether an examinee knows 
or does not know the correct 4 response'.. In addition, the paper derives 
the optimal linear estimate (under squared* erroMoss) of true score that 

is corrected for partial information, and it derives another f<^j|k 

' ----- „ ^ y . 

score under the assumption the Dirichlet-multinomial model holds.' Once 
certain parameters are estimated, the Utter formula score makes it pos- 
sible to correct for partial information using only the examinees usual 
number correct observed score. Tfje importance of this formula score is 
discussed at the end of the- paper. Finally, various statistical techniques 
are described that can be used to check the assumptions, underlying the/ 
proposed scoring procedure* , 




233 



Solving Measurement Problems 
. A-2 



„- t * INTRODUCTION 

When an examinee responds to a multiple-choice test item, there is 
the problem that an examfneVs response mfght not reflect his/her true 
stafe. The most obvious example, and* the one of central concern here, is 
that an examinee l^ighT^guess the correct response without knowing what 
it really is. The common solution to this problem is to assume guessing 
is at random. That is, if there are t alternatives from which to choose, 
"and only one is correct, J\e probability of a correct response when the 
examinee does not know is t . Simuttaneously, however', it is recognized 
^Hat to assume random guessing is indefensible, (hie possibility is that 
an examinee "might be able to eliminate one or. more distracjtors without 
knowing the correct response. In support of this possibility are empir- 
ical investigations on formula 'scoring where it was found that the proba-* 

^bility of guessing is substantially higher than wourd be expected when . 
random guessing occurs '(Bliss, 1980; Cross and Frary, 1977), We might 
assume guessing is at random anyway,. but this carKjmAite^er^ consequences 
in terms of test accuracy (e.£, Weitzman, 1970; tffscox, 1980). 

Th'e purpose of this paper is to examine how an answer-until -correct 
(AUG) testing procedure might be used to take into account the effects 
of guessing. One advantage of the proposed scoring procedure' is that its 
efficacy can be empirically checked in several different ways. The model 

. contains' nuntoer-nght|4cori^| 5 as well as the assumption of random, guessing, 
as a special case. Thus, when observed test scores suggest, that the model 
holds, the apprgpriateness of the two more common scoring procedures can 
be checke^as is illustrated in a later section of the paper. On a related 
matter, the model can be used tgjte^t whether items are "ideal" in the v 



234 



/ 

Solving Measurement Problems 
* ' A-3 

t 

sense defined by Weitzman (1970). This just means that a random guessing 
assumption can be tested. Using the entropy function, it is also possible 
to jneasure how "close" the probability of guessing is to t~*. This is 
important because when the probability is not. close to t~* f this suggests 
it might be possible to improve the distractors which in turn will improve 
test accuracy. The exact se#se in which this is true is explained below.^ 
Another advantage of the model is that it yields a measure of test accuracy 
that is not ordinarily avSflable. Two new formula scores are also derived, 
the advantages and disadvantages of which are/discussed below. 

It should be noted that a scoring rule^for an AUC test has been pro- . 
posed by Brown (1965). The scoring rule has been empirically investigated 
by Gilman and Ferry (1972) and Hanna fl975) who found it to bj> more r3 
liable than nunfcer correct scoring. Moreover, an AUC testing procedure 
has been advocated from the standpoint of enhancing learning (Pressey, 
f 1950). The goal in this paper is to propose a different scoring rule that ' 
corrects for partial information. 

ASSUMPTIONS „ 1 • ' 
It is assigned that when an examinee responds to an achievement test 
item, he/she can be described as either knowing oiinot knowing the cor- 
rejzt response. In the terminology of Reulecke (1977) this means that* the 
model inclydes a binary structure variable, or following Harris and Pearlman 
(1978) examinees are described in terms of a dichotomized latent trait. 
/One more possibility is to. say that an examinee either has or has not 
acquired the "psychological structure" of a task (Spada„ 1977). This 
" means that the model is deterministic in thestfinse that if an exairtinee's 

V 



235 



Solving Measurement Problems 
t ' A-4 



latent state is joiown, and if there are no errors at the item Vevel, it 
would be known whether an examinee would .produce a correct response. 
However, the model includes what Reulecke. (1977) calls an intensity variable. 
In particular, it is assumed that an examinee who does not know might give 
a correct response. The probability of this event is unknown, but it can 
be estimated with the scoring formula and probability model described 
below. • * * 

Following Horst (1933), it is assumed that when aiTexaminee does not 
know, he/she can eliminate -at most t-2 detractors from consideration. 



Once these distractors are eliminated, the examinee chooses -an answer at 
random from among those that rei^Tt^. An examinee who knows, always gives 
the correct response. - \ • ' . ♦ 

FinaTly, an answer-until -correct scoring procedure is assumed. This 




means that an examinee responds to a t^Titem until the correct alterna- 
tive is chosen. . ^ ' • m 

y 

- * THREE TYPES Of GUESSING ~ * 

Before turning to the ney results, it is important to be more precise 

about what is meant by guessing Three types 'can be described. The first 

applies to a situation where randomly sampled examinees "respond to the 

Same multiple- choice item. In this case we define guessing as the proba- 

bility of a correct response given that the randomly sampled examinee does 

not know. The secorj^, or Type II guessing, is. defined in terms of a 

single examinee responding te an item randomly sampled from some item domain. 

The rate of guessing for'the examinee is 'the probability of a correct 

t>, * 
response^ to a randomly sampled item that he/she do&s not know. Finally, 



236 



Solving Measurement Problems 

A-5 



there is Type III guessing which is tfte probability of a corntect response, 
over independent repeated trials where a single examinee responds to a 
specific item he/she does not know. Wilcox (1977a) examines some lateot\ 
structure models that are relevant to this case, but there are some prac- 
tical difficulties (Wilcox, 1979) which limit their use. Only Type I * 
aryJ Type II guessing are considered. 

A MODEL FOR AUC TESTS AND TYPE I GUESSING 

Consider a randomly sanfpled examinee responding to a specific test 

item .using an AUC test. For convenience, particular attention is given 

§ 

to the case where the. mu-ltiple-choi ce test item has t=4 alternatives from 
which to choose, one of which is correct. The results ajre readily ex- 
tended to any value of t. Based on the above assumptions, the exanrfnee 
teyings to one of t=4 mutually exclusive groups. In particular, the exam- 
inee knows the correct; response, or can eliminate 0, 1, ,or 2 distractors, 
Let C be the proportion of examinees who know, and let S. be the propor- 
tion of examinees who c&n eliminate i distractors, The probability of a 
correct response the first time a randomly selected examinee chooses an 
alternative is 

<w * 
p Insert Equation 1 here ^ 

The probability of an incorrect on the first choice and a correct on the 
second is 

* 

Insert Equation 2 here 



237 



' Solving Measurement Problems 

A-6 



The probability of two misses and then a correct is 

Insert Equation 3 here - 

* 

and the probability of three incorrects is ^ 

Insert Equation 4 here * 

More generally, 

4 * * 

o 

> Insert Equation 5 here 

where i=2,..., t. 

For, a random sample of N examinees let x. be the number who corre- 
spond to the event associated with p... For example, x, is the number Of 
examinees who are correct on the first alternative chosen, and x 2 is the 
number of qxaminp§S^ho are incorrect "and then correct. The x.'s have 
a multinomial probability function given by " * 

Insertrtquation 6 here * 

'• 

where 



Insert Equation 7 here 



238 



Solving Measurement Problems 

A~7 



Since z=p l -p 2 > 

Insert Equation 8 here 

is an unbiased estimate of From Zehna (1966) it also follows that £ is 
an unrestricted maximum likelihood estimator. Proceeding in a similar 
manner also yiel.ds unbiased, unrestricted, maximum fikelihood estimates of 
the Cj's, namely, 

Insert Equation 9 here * 
Insert Equation 10 here 

■ * 

+ * 

Insert Equation 11 here 

i 

Noife the model assumes that 

■ / 

Insert Equation 12 here 

Maximum likelihood estimates of the ? f s are available under this restriction 
of the p-'s as noted by Barlow et al. (1972), For example, the maximum ' 
likelihood estimate of assuming equation 12 hold£, is given by ' 
equation 8 when x^x^ and it is c=0 otherwise. , * 



239 



Solving Measurement Problems 

A-8 



Using the Model to Analyze Achievement Test Items 

Macready and Dayton (1977) describe a probability model based on- 
Type I guessing that might be used to analyze mastery tests consisting 
of equivalent items. This section illustrates how the above model can 
be used to analyze achievement- test items in a similar but different 
fashion. 

Suppose, as is customary, it is decided that an examinee knows the 

1 

correct response if the first alternative chosen is the correct answer, 
and that otherwise the examinee does not know. In this case a test 
constructor would like to know the accuracy of the decision about a 
typical % exami nee based on his/her response. 

x The cells in Table 1 give the probability of the four possible 
outcomes when an examinee responds to an item. 



Insert Table 1 here 



■ f \ 

t * 

Thus for a randomly sampled examinee, the probability of a correct decision 
about an examinee's latent state is the proportion of agreement in Table 1, 
namely, 

Insert Equation 13 here r 



240 



Solving Measurement Problems 

A-9 




An unrestricted maximum likelihood estimate of P is just 



Insert Equation 14 here 



where £, l Q> ^ and E 2 are given by equations 8-11. For anyt, 



ion\i 



Insert Equation 15 here. 




P canNalso be estimated assuming equation 12 holds, as is illustrated 
below. In many instances this will yield the same estimate of P as is 
aiven by equation 14, but this is not always the case'. 

Using equation 13, it would seem that for any fixed e,~the accuracy 
of an Item is maximized when guessing is 3± random, i.e., when Ci=c o =0 
and £o=l-5. Tnis can b , e established in a more formal mannej* as follows: 
The inequality 

Insert Equation 16 here 
holds whenever Xj<Xg<. . .<Xj| if and only -if 



Insert Equation 17 hereY 

Sigma, cap and zc-=zb.j (e.g., Marshall and Olkin, 1979, p. 445). It follows that 

P is maximized when c 1 £s ?2 s ° since fic l uation 17 holds when cp(t f c Q , ?p ? 2 ) 

and b»(c f - 0, 0). 
* 

Another way to characterize Table 1 is to use the "del" measure 
developed b} Hildebrand ef al. (1977) which, for the situation at hand, 



ERJC ' o 4 7 




Solving Measurement Problems' 

A-10 




is equivalent to Cohen's kappa '(Cohen, I960). In terms of the c's, this 
measure of association is 

i 

M Insert Equation 18 here 

where 

• Insert Equation 19 here 

<appa, I.e. Following Hildebrand et al., k can be interpreted as follows: Suppose it 
1 is desired to measure the extent to which an examinee's latent state can 
be "predicted" according to the decision rule being used. The off-diagonal 
cells in Table 1 represent the error rates. The index k represents the 
proportional reduction in the number of cases in the pair of error cells 
when a shift is made from statistical independence with tie population 
marginals to the actual probability structure. . 

Note that Equation 18 is the value of k assuming the model holds. 

A Measure of Item "Ideal ness" 
t Weitzman (1970) describes 'an asymptotic test of whether an item'is 
ideal. As previously indicated, an item is defined to be ideal if guessing 
is at random. In the above notation, this corresponds to having S^Cg" 0 
which implies jthat p 2 =p 3 =p 4 . A practical problem is that the null hypo- 
thesis that P 2 =P3=P 4 might be tested and rejected, when in fact p 2 , p 3 
and p 4 are nearly the same in value. This in turn might lead to efforts 
in improving the distraqjbors when the item is already close to being ideal. 





ERIC 2«2 



Solving Measurement Problems 

. . A-n 



The simplest approach to this problem is to estimate and e 2 and 
see how close they are to zero. If they are not, simply examine the dis- 
tractors and decide whether any of them can be improved. 'Some additional 
possibilities are described and illustrated belbw. ' , 

When trying to determine whether ^ and ? 2 are both close to zero, 
i,t might be desirabl^to take into account their combiaed effect on how 
close the item is to being ideal. Looking. at ^ and e 2 separately, they 
might appear to be close to zero, but together, perhaps the item could 
be improved by a substantial amount. The problem becomes more* complex 
when more than three distractors are used. Thus, it would be convenient 
to have some measure of how well an item approximates the ideal situation 
where ? 0 =l-e. 

One approach is to estimate e which yields an estimate of the pro- 

4* 

portion of agreement in Table 1 for the case Cg=l-C Thus, we have 
estimated the maximum possible value of P for fixed 5, say P , which 
corresponds to the estimated value of 5. For t=4, P^^ ^ + 4 . Next, 
estimate P which yields an estimate of 

^ • 'V 

Insert Equation 20 here 



This gives a measure of how ideal the item really is. When the model 
holds, A>0, and the closer A is to zero, the better the item. 

Employing the A measure seems to be intuitively appealing, and in 
some situations it might suffice.. However, there are at least two ob- 
jections to its use. First, it has been suggested (e.g., Marshall and 

f 



243 



Solving Measurement Problems 
* A-12 

Olkin, 1979, p. 408) that measures of inequality should have certain proper- 
ties, namely, they should be Schur-convex, or strictly Schur-convex. Here 
the goal is to measure the inequality oJ p 2 , p 3 , p^. (The meaning of a 
Schur-convex function is not given since it does not play a direct role in . 
the results to follow. The interested readers referred to Marshall 
and Olkin, Chapter 3.) This requirement was first formulated by Daltoh 
(1920), and ste*ps,in this direction were taken by Lorenz (1905) and Pigou 
(1912). Thus, as a measure of the inequality of p 2 , p 3 and p^A might 
be objectionable because it is not Schur-Convex. To see this, it is 
sufficient to observe that A, as a function of p 2 , p 3 and p 4 , is not 
symmetric. The second objection is that even when the model holds,' the 
estimate of tj and'c 2 can be negative, and the estimate of c Q can be 
greater than one, In this case A cannot be interpreted as. a difference 
of two probabilities. Perhaps we could use A anyway, but an investiga- 
tor might prefer to use a more traditional index of inequality. 

For the problem at hand, the index of Inequality that suggests 
itself is the entropy function. The entropy of a probability mass func- 
tion Pj,>0, k=l, . .., r, is. - 

m 

Insert Equation 21 here 

where Ep^l* (In some instances, the logarithms in equation 21 are taken 
to the base 10 or the base 2. See Kullback, 1959, p. 7.) The function H 
provides a measure of the degree of uniformness of a distribution. That 
1$, the larger is H, the more uniform is the distribution. The jninimum 
value of H occurs when pj-1, its maximum value occurs when pj=.. . P r =l/r, 
and it is Schur-concave (implying that ~H is Schur-convex). See Marshal 

244 ' 



Solving Measurement Problems 

A-13 



9- 




.erJc" 



and^fkin (1979, chapter 13, section E). To measure the idealness of >^ 

^tfi item, the inequality of p 2 , p 3 and p 4 needs to be measured which 

/suggests that H(q p q 2> q^be used where q^p^/d-Pj), 1=1,2,3. In 

/ this case the maximum possible value of H. occurs when q^t-l)"^ 

An additional reason for using the entropy function is given in the next 

section of the paper. Brown (1965, section 3) also used the entropy 
♦ 

function but in a slightly different fashion. 

Empirical Checks on the Model 

From equations 1-4 various, restrictions on the p.'s are evident in 
order for the model to hold.. For instance, it requires having p^H^^P^P/p 
This assumption can be tested using results reported by Robertson (1978). 
It should be noted that when P i~P2 » tne probability of having Xj>x 2 
approaches .5 as N, the number of -examinees, gets large. Thus, there is 
a reasonably high probability that the usual estimate of the p^s will 
indicate that the'model does not hold when the p.-'s are approximately 
equal in value. Of course, the hypothesis P 2 =P 3 =P4 can be tested, but 
this does not give a direct measure of how ideal an item is. The null, 
hypothesis might be rejected, for example, but this does not directly in- 
dicate the extent to which p 2 , P 3 and p 4 are unequal. Another approach might 
be to estimate H, especially when the data suggests the model might not 
hold, and if H is reasonably close to its maximum value, decide that the< 
item 1s ideal. We are not suggesting that hypothesis testing be discarded 
all together, the point is that the entropy function gives us some addi- 
tional information about how close an item is to being ideal that is 
otherwise unavailable. It might help to note that a similar situation 
occurs in the analysis of variance (Hays, 1973, pp. 484-488). 

/ 




J 



{ Solving Measurement Problems 

. v ' A-14' 

Another requirement of the model is that otherwise, c Q >L ^~ 

* 

For similar reasons the model requires that P3-P4 <. 1/3 and P2-P3 £ h-> 

However, P 1 ^P 2 ^3^P4 im P^ es ^at these additional inequalities are true. 

* 

Illustrations , 

The results, given above are illustrated with test scores for students 
enrolled in an undergraduate psychology course at the University of Southern 
California. Each item had t=5 distractors. There were four test forms, 
and each form had forty -3 terns For simplicity, only^4 items are analyzed, 



and only one test form isMfSed. A more extensive analysis of tjhe data, 
together with some new theoretical results, will appeal in a forthcoming 
report. ^ 

Table 2 gives the 6bserved frequencies of the number of examinees 
who g\t the item correct on the ith attempt^ . . . ,5) . For example, * 
there were 42 examinees who were incorrect on their first attempt, but 
were correct on their second attempt of item 2. 



Insert Table 2 here 



The first step when a^nng the results given above is to test the 
hypothesis that equation 12 holds. As 'pjreviously mentioned, £f?is is accom- 
plished with results in* Robertson (1978), This was done for all 40 items 



on the te£t using a .&1 Tevef of significance For items 1 and 2 in 
Table 1 2, applying Robertson's test is not necessary since the estimate of 



/ 

^thep.'s already satisfies equation 12. Item 3 is highly nonsignificant 

"but the null hypothesis is rejected for item 4. 

* i\l 
For 21 of the 40 items, Robertson's test v/as unneces$ary since the 

^^stimate of the p^s satisfied equation 12. For the remaining items, 

. . 24\6 



Solving Measurement Problems 

A-l 5 



the null hypothesis was rejected only once; this was,. for item 4 in «. 
4 Table 2., * f * ^ 
*' ,Next suppose' a test constructor wants to de t eTffli^a. jwhe ther, a_ xoa^ 



J* 



ventional scoring procedure will yield reasonably accurate decisions about 
whether an examinee has acquired the skills represented by items 1, 2* 
and 3 in Table 2. An estimate off via equation 15-yielcfs a partial 
solution to this problem, flrifems 1 and 2, the estimate df ? is' 
(139-14)/168=. 744 and ( 100-42)/168=. 345, respectively. Thus, the corre- 
sponding estimates of P are . 9i7 and .75.' 

} As fj^| item 3, estimating and p^ under the Assumption that 
equation^ holds requires an application of the pool-adjacent-violators 
algortyitin Barlow et al . {1972, pp. 13^18) v The result is. ' 
P3=p^=(29+16)/(2(168))=.134. The estimate of c is -202, and so the esti- 
mate of P is .797. Note that using the. pool -adjacent violators algorithm 
ySdSqls the sartfe estimate qf P as is obtained when^equation 15 is used 
and when 'p.. is estimated with x^/n. yowever, Mien x^Xg, using p^xyN 
will yield different results. The reason is that the jj^ximum likelihood . 
of c, assuming equation 12, is when x^x^ and it is (x^-XgJ/N 
otherwise* ^Consider, for^example;'item 4 in Table 2. c=0, ajwi;ihe 
maximum likelihood estimate of p 9 , 'assuming equation 12, is .369. Thus, 
the estimate of P is ,63. If, however, we use p .*x^/N, tha estimate of 
P is .446. ' * . , . 



S^pose the first three -items in Table 2 constituted the whole test, 
^nother important p($Hk is t^at^jthe estimates- of P yield an estimate of y,» 
the expected number of correct decisions for the n items on ttie test, r 



0 



The estimate \$ simply the sum of the estimated P values. For the c&se 
at hand y is estimated to be 2.46* Thus, when a conventional scoring 



S - 7 '247 



Solving Measurement Problems 

procedure is used to determine whether an examinee knows the correct 
response to aji item, the expected number of correct decisions for the 

_fj_rst_tto estimat^ to be 2.46. . *> 

If any of the P values is small", one possible way to improve the 
item is to improve the distractors/ For example, efforts might be made 
to improve the least frequently chosen distractor. 45 1 

To measure the effectiveness of the distracjors, the -entropy function 
is applied. For item 1 in Table 2, q^.483, q 2 =. 31, q 3 =.138 and q 4 =.069. 
Substituting these values into equation 21 yields H=1.172. The maximum 
possible value of H occurs when q,-=.25 (i=l,2,3,4) in which case H=l?386. 
For item 2, H*.99 and for item 3 H=1.347. Thus, the. test scores indicate" 
that the item with the most effective distractors is item '3 followed by - 
item ,1. The distractors for item 2 are the least effective having achieved 
71.4% of the maximum possible entropy. fc , 

It should be pointed out that the above estimate of H for item 3- 
was not made under the assumption that equation 12- holds. If equation* 
HZ is assumed, and the pool -adjacent-violators algorithm is applied, 
*~this yields p^.405, p 2 =p 3 =p 4 =. 1568 and p^=.125Mn which case H= 1.382:' „ 

* * * * 

\ In either case, item 3 has the most effective distractors. 

* 1 - A MODEL FOR TYPE ^GUESSING 

In- many instances a test consists of items representing skills that 
are thought t^ be most important. Moreover, there are situations where 
the skills on a test are the only ones that are of interest to the test 
'constructor. However, in other. situations (see, e.g., Hambleton et al., 
197$) the items on a test are intended to he a representative sample of 



ERIC ^ ' 4 . ; * 248 



r Solving Measurement Problems 

. • - . . : c' A - ?7 

* v 

b > ' . i 

some larger item domain. The goal is to use test results to make infer- 
ences about what an examinee knows relative to the item pool. In either 

case, the results in the previous section a^re of. interest. This section 

_ _ _____ ^ — . 

considers how an AUC test might be used to -solve certain measurement 

* 

problemsjtfhen generalizing results for a single examinee to an item domain. 

For a specific examinee, let 5 be the proportion of skills among a 
domain of skills that he/she has acquired. Further suppose that each skill 

is represented" by a multiple-choice test item having t alternatives from 

f - 

which to choose. Again for convenience, emphas'is is given to the special 
case t=4. Let r. (i=0, ... , t-2) be theproportion of items for which 
the examine? does not know and can eliminate i distractors. Once i dis- 
tractors ar^el imina^S^ the examinee is assigned to giiess at random' from 
among those that remain. Let r. be the probability of a correct on the 
ith attempt ^ Thefpfor t-4> 

Insert Equation 22 here 



Insert Equation 23 here 

■ ^ . 

• \ 

Insert Equation 24 here 



Insert Equation 25 here 




249 




A. • • ' • • • 

" t . Solving Measurement Problems 

A-18 



If for a; random sample of'n items, y. is the number of items the examinee 

1 

is correct on the ith alternative chosen. , An 'unbiased estimate of the 
S.-'s can "be derived just as unbiased estimates -of ^.*s derived in 
the previous model. In particular, an unbiase/restimate of £ is 

" Insert Equation 26 here 



Equation 26 is an estimate of true score that is corrected for an 
examinee's partial information. Note that equation 26 contains the usual 
correction for guessing formula* score as a special, case/ 

The Optimal Linear Estimator of £ 

Let 2 be a random variable that is an unbiased estimate of the unknown 
I.e. parameter 8. Under squared error loss, Griffin and Krutchkoff (1971) show 
that the optimal linear estimator of e is 

Insert Equation 27 here 
Alpha, Kc. / % / % / % / x 

Delta, I.e. where a = Var(e)/Var(z) an d 6={l-a)E{e). In rental test theory, equation \ 



27 is knowj^ Kelley Vlinea^kpgression estimate of true score (Kelley, 
1947, p. 409). The point made by Griffin and Krutchkoff is that if an 
unbiased estimate of an examinee's >true score is used, equation. 27 is 
optimal regardless of the shape of true score distribution* Wilcqx (1978, 
compares equation 27 to several other estimators assuming the binomial 
error model hoi els but where observed scores are generated according to a 
two-term approximation to the compound binomial error model. The results 
suggest that when simultaneously estimating the true score of several 



■ © 

SM£ ' * 250 




Solving Measurement Problems 

A-19 



examinees, the Griffin-Krutchkoff estimator should be used when an ensemble 
squared error loss function is being used.' Furthermore, the results sug- 
gest tkat Kelley'slinea r reg ression estimate of £ be employed' 

It is assumed that the y>*s have a multinomial distribution and that 
observed test scores for N examinees are available. An estimate of E(c), 
Var(c) and Var(y^y 2 Kls needed to apply results in Griffin and Krutchko'ff 
where, the expectations defining these quantities are over the population 
of examinees. 

Let 

Insert Equation 28 here 



Insert Equation 29 here 



where i=l, 2. tK 



en 



Insert Equation "30 here 



Insert Equation 31 here 



Since 'cov(yp y 2 IPp P 2 ) = :2n PiP2> follows that 



> Insert Equation 32 here 

> 



/ 



251 



Solving Measurement Problems 

A-20 

Thus, 9 t % 

r Insert Equation 33 here 



and 

Insert Equation 34 here - 

Letting and v.* be the value of y . and v., respectively, for the jth 
randomly sampled examinee, the above Results suggest that E(c) be esti- 
mated with , 

Insert Equation 35 here 

and E(? 2 ) with 

> 

Insert Equation 36 here"* 
Thus, an estimate of Var(s)"is • 

n 

$ -r Insert Equajtion 37 here J * 

The variance of the marginal distribution of observed scores * (y^-y ^) /n 
can be estimated in the usual manner, and^ so an estimate of the optimal 
linear estimator of 5 is obtained by substituting the, results in equation 
27, Of course, the results just given contain, as a* special case, the 
optimal linear estimator under the assumption guessing is at random. 



252 



Solving Measurement Problems 

A-21 



- Numerical Illustration 

As a simple illustration, suppose we have five examinees with observed 
y values_as shown in the firs t two rows of Table 3, where the test length 
is n=10. 



Insert Table 3 here 



S1gmaf U l.t. Then V* 42, V* 2 * ^ so V #023 ^ ™ e estimat * of wCCy^J/S 

is .0687. Therefore, the estimate of a is £=.3435, and so the estiamte 

of the optimal linear estimator is 



Insert Equation 38 bereft 



The value of £ for the five examinees are given in the last row of Table 2. 

Before continuing, some additional comments about the above results 
are in order. First, the estimate of varU) can be negative in which case 
ofX is used. The same phenomenon occurs in the case considered by Griffin 
• and Krutchkoff. Second, the optimal linear estimator of 5 derived above 
does not assume &tit model holds; It is the optimal linear estimator 
Of Pj~P2> but no insistence ts made that p^>p 2 - ^ the roodel holds, 
implying that p^^' e Q uation 33 is no longer true, and so the condition 
of having an unbiased estimate of 5, as is assumed by Griffin and Krutchkoff, 



er|c ' 2.53 



^ Solving Measurement Problems 

is no longer satisfied. For further comments on this approach to estimation 
see Griffin and Krutchkoff (1971) and Wilcox (1978). 

f 

A Strong True 3 core Model 

This section assumes that for any examinee, y^and y 2 have .a'multi- 
nomial probability function given by < 

r Insert Equation 39 here 
• » 

where, as before, 5=p 1 ~p 2 and 0<£<1 is assumed. Equation 39 can be justi- 
fied under an item sampling model, or it might give a good approximation 
to the joint probability function of y 1 and y r It should be n&ted that ■ 
equation & implies that y^ has a binomial probability function, and 
so when every examinee takes /the same n items, the -items have the same" 
level of difficulty (Lord and Novick, 1968, chapter 23). On theoretical 
grounds, this implication of equation 34 is unjustifiable. However, for 
certain measurement problems, it appears that this might not be a serious 
restriction. (Wilcox, 1977, 1978; Algina and Noe, 1978). See also 
Subkoviak (1978). 

Strong true-score models attempt to extend assumptions such as equa- 
tion 39 %o a population of examinees. The basic problem here is to find 
a family of distributions that approximates-g^^) , the joint density of 
£ and pg. Once this is done, various measurement problems can be solved 
(e.g., Lord, 1965; Huynh, 1976; Wilcox, 1977). 



254 



m f ' Solving Measurement Problems 

< . ' ' A-23 

Past experience with this type of problem (Keats and Lord, 1962; 
Lord, 1965; Wilcox, 1979) suggests approximating g(s,P 2 ) with a bivariate 
Dirichlet function given by 

Insert Equation 40 here 

» • 

Gamma, cap where r is the usual gamma function, v^O (i=l,2,3) are unknown parameters 
and 0<^p 2 <l. (Marshall and Olkin, 1979, pp. 306-307 describe two other" 
distributions to which- the name "Dirichlet" is attached. Here, only 
equation 40 is consi-dered. ) 

To estimate the v^, proceed as follows: first, observe that the 
•marginal distribution of 5 is beta with parameters Vj and v 2 +v 3 (e.g., 
Wilks, 1962). It follows that ■ — » 

Insert Equation 41 here 

where, as before, is the mean of 5 over the population^ examinees. 
For similar reasons, ~ - 

Insert Equation 42 here — 

4 _ 
• where p p is the mean of p 2 . It is also known .(e.g., Wilcox, 1977) that 

Insert Equation 43 here 

* where _ 

Insert Equation 44 -here 



ERLC 255 



Insert Equation 45 here 



ERLC 



As previously indicated, p ? and ^ can be estimated* ^hich yields an 
estimate of s. An estimate of p 2 is N" 1 !^, aw^so equation 44 yields 
'an estimate of the v-'s. 

Mosimann (1962)' applies the Dirichlet-multinomial model to two real 
data sets, he discusses how to check the implications of the model, and 
he gives several other results that hav£ practical value, and so these 
issues are nat disjcussad further. Since the Dirichlet-multinomial model 
is the multivariate analog of the beta-binomial model, additional insights 
into the appropriateness of the model are available from Wilcox (1981). 
The point is that the Dirichlet-multinomial model can be applied to AUC 
scoring procedures and so solve various measurement problems as previously 
indicated. An advantage of the model is that it allows guessing to vary 
her the population of examinees. 

An important point is that if the model is assumed to hold, and in 
particular 0<£<1, this suggests estimating £ to be zero even when £<0. 
Ih this case the' estimates of E(c) and E(c ) are not justified for the 
reasons given above, but they are still appropriate for the. reasons given 
by Wilcox (1979). 



256 




Solving Measurement Problems" 

A-25 

One point that deserves special mention is that a new formula score 
can be derived that corrects for partial information. The derivation is 
essentially the same as the derivation of equation 4 i n__Wfl cox j[ 1979)..- 
Thus, we merely note that 



Insert Equation 46 Tiere 

where B is the usual beta function* Thus, once the v-'s are estimated, 
we only need y^ to estimate 



DISCUSSION 

One objection to the assumptions that were mdde is that the re- 
b suiting model is too simple* For instance, it does not allow for the 
possibility of knowing and being incqjrrect, or the possibility of haying 
misinformation. Brown and Burton (1979) describe a real situation where 
the latter problem occurs, Frary (1980) gives an interesting account of 
how misinformation can affect various scoring procedures, and Wilcox 
(1980) indicates the seriousness of the- former problem when determining 
the length of a criter-ion-referenced test. Although the present model 



does not correct these problems, empfrical checks on the appropriateness 
of the model can be made/ It should be mentioned that models have been 
proposed for handling the two errors just described (e.g., Duncan, 1974; 
Macready and Dayton, 1977; Dayton and Macready, 1976). However, these 
models require additional assumptions that might not be met. The Macready- 
Dayton model, for example, assumes that equivalent items are available 



ERIC 



257 



• . Solving Measurement Problems 

A-26 

for measuring a particular skill. The assumption of equivalent items can 
be checked using a goodness of fit test (Macready and Dayton, 1977), using 
a procedure described by Hartke (1978), and results reported by Baker J 
and jlubert (1977) might also be useful in- this endeavor. (See, also, 
Wilcox, in press, a.) Here it is assumed that empirical investigations 
fail to support the existence of equivalent items , or tha$ it is decided 
a. priori that equivalent items do not exist. Finally, the Duncan model 
corrects for misinformation, but it assumes guessing is at random. The - 
goal here is to avoid this restriction, or to find ways in which it can , 
be empirically checked. 

Another possible objection to the model is that it characterizes 
examinees as belonging to one of two mutually exclusive classes, 1 namely, 
"knowing" and "not knowing." The relative merits of this approach are 
discussed in a, more general context by Reulecke (1977), Hilke et al. 
(1977), Scandura (1971, 1973), and Spada (1976). 

In some situations, the scoring procedure for Type II guessing might 
be objectionable because it penalizes an examinee for having partial infor- 
mation. That is, Jf an examinee wants to maximize his'/her score (the *' 
estimate of c) the strategy would be to minimize y 2 - This could be done 
by choosing an answer, and i.f it is wrong, deliberately choosing another 
response that is believed- to be incorrect. In this case the examinee is 
not behaving in the manner assumed, and so the model *is inappropriate. 
One approach to this problem is to have an examinee always mark his/her 
first and second choice without revealing which response is Correct. 
Letting y^ be the number of times the examinee's first choice is correct, 
letting y 2 be the number of times the second choice is correct, £ is* again 
estimated with (yj^J/n. Indeed, all of the previous results still 

258 



-J 

Solving Measurement Problems 

'A-27 

hold. However, this might not eliminate the problem under discussion. 
Suppose, for example, that an examinee can eliminate all but two of the 
alternatives from consideration for every item on the test. If an exam- 
inee's two choices correspond to these two alternatives, the expected 
estimate of £ is 0. However, *if an examinee's first choice is between 
the two alternatives that contains the correct response, and if the exam- 
inee is deliberately incorrect on the second choice, the expected 
estimate of 5 is .5. One way to minimize this problem is to subject the 
items toan analysis that attempts to ensure guessing is at random. It 
was already indicated how this might be done. Another solutionis to 
apply the Dirich.tet-mul tinomial model. If estimates of the v^s can be 
made available, the information on the examinee's first choice, the valuel 
of yp is all that is needed in order to estimate 5. Several other strorfg 
ture-score models are currently being investigated that might be useful 
when addressing this problenp. Another possibility i§*to check the assump- 
tions of the model; if they do not hold, simply score the test using 
traditional techniques. 

For practical purposes* perhaps the problem just described will be 
inconsequential; this remains to be seen. Also rtote that this problem 
is irrelevant in terms of the results g'iv&n tader Type I guessing. 

In practice, the scoring rule proposed by Brown *(1965) results in 
scoring t-i points when the correct response is chosen on the ith attempt 
of an item, where, as before, t is the ndmber of alternatives from which 
to choose (e.g., Frary, 1980). Thus, the sooner an examinee identifies 
the right answer, the higher will be his/her score. In some cases, however, 
this scoring procedure^ also inadequate. First, it gives credit to an 
examinee when a test constructor unintentionally produces ineffective 



4 

259 > 



Sol vfng Measurement problems 

A- r 28 



•distractors. ^ Second, and perhaps most importantly/ it gives a measure of ** 
partial information^ but it does not tell us wh,at an examinee knows^ in 
the sense of estimating g." The same is true~of the other scoring 'procedures 
cited by Frary (1980), the scoring rule proposed by Coombs*et al . (1956), 
as well as the subset selection rule proposed by Gibbons, 01 kin and Sobel 
(1977, 1979). No claim is made that these procedures be abandoned, but as 
argued by Morrison and Brockway (1979), estimating 5 can be important. 
Another point is, that only two Responses to each item aVe needed in 

order-'to estimate 5 ,for each examinee. The additional responses are 

V 

needed only for checking the appropriateness pf the model , and in par- 
ticular, justifying (y^-y 2 )/n as an estimate of 1. In some cases n will 
be too small to accurately test the model. D&J^rmining whether this is 
the case can be accomplished with the statistical* techniques described • 
under Type I guessing. . v " 

Finally, it was suggested that the Dirichlet-muTtinomial distribu- 
tion be considered when trying to find a strong true-score model that 
fits the data. It shojuld be stressed, however, that considerably more 
experience with this distribution is needed before it is routinely applied. 
Wilcox (in press, b) got good results, with the distribution using real 
data, but the extent to which it gives a good fit to mental test data is ^ 
not known. An empirical investigation is currently updfcrway in an attempts 
**to partially resolve this problem. Consideration will also be given to 
several dther strong .true-score models. The results should be available 



in the near future. 

. * \ 



jr 260 • 



Solving (teas urertient Problems, 
REFERENCES ' "•• ' * A-29 




Algina, J. , & Noe, M. G; A study of the acc^ifiP^B^ubkoviak's single- 



administration estimate 'of the. coefficient^ agreement using two 
, " ^rue-^re estimates. Journal of Educational Measurement , 1978, . 

JL5, 101-110. ' ~ L ; - 
Baker, F. B.-, & Hubert, !. ^J-. Inference procedures far ordering theory. 

Journal ^TEducational Statistics^ , 1977,'2, 21.7-233/ 
Barlow, R. E.> Bartholomew. D. J., Bremner, J. M. , & BYunk, H. D. 

Statistical inference under order restrictions . New York: Wiley, 
1972.- ■ ; " * 

Bliss? L. B. A test of Lord's assumption regarding examinee guessing 
, ,beh*avior on multiple-choice tests using elementary School students-. 

*" ^^mal of t^cab'onal Measurement? 1980, 17, 147-153." 
Brown, J.- Multiple response evaluation of discrimination. The British 

V • ■ — 

Journal <pf Mathematical and Statistical Psychology , 1965, 18, 

125-137. ' ' * •* 

Brown', J.^., & 'Burton, R. R. Diagnostic models in- basic mathematical 
t skills, In theNational Institute of Education, Testing, Teaching, 

and Learning: Report of a Conference on Research on Testing. . * 

Washington, D.C.: U7S. Department of Health; Educational and, Wei fare, 

1979., . • . 

Qohen, m J. A coefficient of agreement for nominal scales. Educational and 

Psychol ogi cal Measurement , • I960* 20 , 37-46. 
Goombs, C. H., Mflholland, J. E., & Womer, F. B. The assessment of partial 

knowledge. Educational and Psychol ogicaffieasurement , 1956, 16, 16-37. 
Cross, L. H., & Frary, R. B. An empirical test of Lord's theoretical/ 

results regarding formula scoring of "multiple-choice tests. Journal. 
. • of Educational Measurement, -1977, '14', 313-321. 



261 



} * Solving Measurement Problems? 

"/*.'. • • ' " * A-30 

Dakon, H.^ The measurement--©^ the inequality of incomes. Econom. J. 
1920, 30, 348-361. J 



Dayton, C. M. , & Macready, £. B. A probabilistic model for validation' 
of behavioral hierarchies. Psychometrika . 1976, 41, 1S9-204. 

ft 

Duncan, G.T. An -empirical Bayes approach to scoring multiple-choice 

• \ * 

tests in the misinformation model, Journal of the American 

Statistical Association . 1974, 69, '50-57. 
Fhaner, S. Item sampling 'and decision making in achievement testing. 

British Journal of Mathematical and Statistical Psychology , 1974,** 
27, 172-175. 

Frary, R. B. The effect of misinformation, partial information, and 
guessing on expected multiple-choice test item scores. Applied 
Psychologi cal Measurement . 1980 . 4, 79-90. 
.Gibbons,. J., Qlkin, 1., & SobeA , M. ' Selecting and ordering populations : 

A new statistical Methodology . New York: Wiley,' 1977. 
Gibbons, J. D. , Olkin, & SobeT, M, A subset selection technique 

for scoring items on a multiple choice~test. Psychometrika , 1979, , 
44, 259-270. ' - ... 

Gilman, D. A.', & Ferry, P.. Increasing test reliability through seTf-scoring 

procedures. Journal of Educational Measurement , 1972, 9, 205-207. 
Griffjn, B. S., & Krutchkoff, R: G. Optimal linear estimators: An 

empirical Bayes version witi^appl i cation to the binomial distribution. 



v 



•/itbsappl 



Biometrika ,' 1971, 58, 195-201. 
Hambleton, R. K, , Swaminathan;' H,', Algina, Coulson, D. Criterion- 

referenced testing and measurement: A review of technical issues 
and deVelopmeats. Review of Educational Research , 1978, 48, 1-47. 



262 



Solving Measurement Problems 

A-31 

Hanna, 6. S. Incremental reliability and validity of multiple-choice 
.tests with an answer-until -correct procedure. , Journal of Educational 

Measurement , 1975, 12, 175-178. 
Harris, C. W., &Pearlman, A. P. An index for a domain of completion or 

short answer items. Journal of Educational Statistics , 1978, 3_, " 

285-304. 

Hartke, A. R. The use of latent partition analysis to identify homogeneity 
of an item population. Journal of Educational Measurement , 1978, 15, 
43-47. 

Ha ys> W. Statistics for the Social Sciences . New York: Holt, Rinehart 
and Winston, 1973. 

Hildebrand, D. K. , Laing, J. D. , & Rosenthal, H. Predfcatioo^nalysis of 

cross classifications . New York: Wiley, 1977. 
Hilke, R. , Kempf, W. F. , & Scandura, J. M. Deterministic and probabilistic 

theorizing in structural learning. In H. Spada and F. Kempf (Eds.) ' 

Structural Models of Thinking and Learning. Benf: Haus Huber, 1977. 
Horst, P.' The difficulty of a multiple choice test item. > Journal of 

Educational Psychology , 1933, 24, 229-232. 
Huynh, H. Statistical consideration of mastery scores. Psychometrika , 

1976, 41, 65-78/S^ f 
Keat$, J. A., & Lord, F. M- A theoretical distribution for mental test 

scores. Psychometrika , 1952, 27, 59-72. 
Kelley, T. L. Fundamentals of statistics . Cambridge: Harvard University 

Press, 1947. * ' ^. . 

Kullbdck, S. Information theory and statistics . New York: Wiley, 1959. 



ERIC 



263 r - x 



Solving Measurement Problems 

A-32 



Lord, F. M. A strong true-score theory, with applications. Psychometrika , 
1965, 30, 239-270, 

« 

Lorenz, M. 0. Methods of measuring concentration of -wealth. Journal -of 
tfie American Statistical Association , 1905, 9_, 209-219. 

Macready, G. B., & Dayton, C. M. The use of probabilistic models in the 
assessment of mastery. Journal of Educational Statistics , 1977, 2^, 
99-120. 

Marshall, A. W., & Olkin, I. Inequalities: Theory of majorization and 

±t$ applications . New York: Academic Press, 1979. 
Morrison, D. G., & Brockway, G. Ajfflodified beta-binomial model with 

applications .to multiple choice and taste tests. Psychometrika , 

1979, 44, 427-442. 

.oMosimann, J. E. On the c6mpound multinomial distribution, the multivariate 
p-distribution., and correlations among proportions. Biometrika , 1962, 
49, 65-82. f _ • 

Pigou, A* C. Health and welfare . New York: Macmillari, 1912. 

Pressey, S. L. Development and appraisal of devices providing immediate 
automatic scoring of objective tests and concomitant self-instruction. 
The Journal of Psychology , 1950, 29, 419-447. 
. Rao, C. R. Linear statistical inference and its application . New York: 
Wiley, 1973. f 

Reulecke, W. A. A statistical analysis of deterministic theories. In 
H. Spada and F. Kempf (Eds.) Structural Models of Thinking and 
Learning . Bern: Haus Huber, 1977. 

Robertson, T. Testing for* and against an order restriction on multinomial 
parameters. Journal of the American Statistical Association , 1978, 
73, 197-202. . , . 



ERIC X 



Solving Measurement Problems 

A-33 

» 

Scandura, J. M. Deterministic theorizing in structural learning. Journal 

of Structural Learning . 1971, 3, 21-53. 
Scan dura, J. M. Structural learning: Theory and/ research . New York: 

Gordon and Breach, 1973. V 
Spada, H. Logistic models of learning and thoughtXln H. Spada & 

F. Kempf (Eds.) Structural Models of Thinking and Learning . Bern: 

Hans Huber, 1977. 

Subkoviak, M. Empirical investigation of procedure^ for estimating 

reliability for mastery tests. Journal of Educational Measurement , 
im, 15, 11.1-116. 

Weitzman, R. A*. Ideal multiple-choice items. Journal of the American 
Statistical Association . 1970, 65, 71-89. 

' \ • 

Wilcox, R. R. New methods for studying stability. In C. W. Harris, 

A. Pearlman, & R. Wilcox Achievement Test Items - Methods of Study : t . 
CSE Monograph No. 6 , Los Angeles: Center for the Study of Evaluation, 
University of California, 1977. (a) 

Wilcox, R. R. Estimating the likelihood of false-positive and false- 
negative decisions in masTeTy testing: An empirical Bayes approach. 
Journal of Educational Statistics , 1977, 8„ 289-307. (b) 
[Wilcox, R. R. Estimating true score in the compound binomial error model. 
Psychometrika, 1978, 43, 245-258. 

Wilcox, R. R. Achievement tests and latent structure models. British 
Journal of Mathenrati^l^a^Statistical Psychology . 1979, 32, 
61-71. 

Wilcox, R. R. Determining the length of a criterion-referenced test. 
Applied Psychological Measurement , 1980, 4, 



265 



Solving Measurement Problem^ 

A-34 



■f 

Wilcox, R. R. A review of the beta-binomial model and its extensions. 

Journal of Educational Statistics , 1981, to appear. 
•Wilcox, R. R. Analyzing the distractors of multiple-choice test items 

or partitioning multinomial cell probabilities with respect to a 

standard. Educational and Psychological Measurement , in press, (a) 
Wilxrox^-R-. iU The single administration estimate of the proportion of 

agreement of a proficiency test scored with a latent structure model. 

Educational and Psychological Measurement , in press, (b) 
Wilks, ^S. Mathematical statistics . New York: Wiley, 1962. 
Zehna, P. W. Invariance of maximum likelihood estimation. Annals of 

Mathematical Statistics, 1966, 37, 744. 



286 



Solving Measurement Problems 

A-35 




EQUATIONS 



Pj « 5 + c 0 /4 + + s 2 /2 



C l] 



P 2 s C 0 /4 + Cj/3 + ? 2 /2 



C 2] 



P 3 = 5q/4 + «i/3 



[ 3] 



P 4 ■ V 4 



C 4] 



t-i 

Pi = z Cj/Ct-j), 



C 5] 



[xj r l r 2 r 3 r if 



C 63 



J = NI/(x 1 !x 2 !x 3 !x 4 !}, x 4 = N-Xj-Xg-Xg, Zp.pl. 



C 73. 



I = (xj-x^/N 



C 8] 



Iq = 4x 4 /N 



C 93 



w 3(x 3 -x 4 )/N 



[103 




? 2 = 2(x 2 -x 3 )/N 



CU3 



ERIC 



267 



Solving Measurement Problems. 

A-36 



?! > P 2 > P 3 > P 4 - 



[12] 



P = C + 3c Q /4 + 2^/3 + c 2 /2 



[13] 



P = c + 3; Q /4 + 2e 2 /3 + c 2 /2 



[14] 



P = c + z P,- • 
i=2 1 



[15] 



Zb..x. < Zc.x. 



[16] 



k k 

z c 4 > z b., k=l, 
i=l 1 " i=l 1 



, n-1 



£17] 



k ='1-(1-P)/B 



[18] 



B = 



3C o , z h , h 



K n K i K ? 



A = P - P 

max 



[20] 



H(Pj 



> • • * > 



P r ) = ^P k log e P k 



[21] 



r 2 = C + £ 0 /4 + Cj/3 + 5 2 /2 



[22] 



9 

ERIC 



268 



Solving Measurement Problems 

A-37 



r 2 - 5 0 /4 + + c 2 /2 [23] 

r 3 =C 0 /4 + 5 1 /3 I [24] 

r 4 - V 4 , * [25] 

I = (y r y 2 )/n ' C26] 

8 = az + 5 ' [27] 

v. = y^n [28] 



w- = 



i n n-l 



J 



E(w 1 +w 2 -Il2y 1 y 2 /(n(n-2))n)^E(c 2 ) 



* * 



269 



[293 



EtylPj, P 2 ) = Pi t303 



E(w.|p r p 2 ) = pf . • [313 

E(y 1 y 2 lp 1 , P 2l ) 88 n(n-2)p 1 p 2 ' _ [32]' 



E(v r v 2 ) = EU) - [33] 



[34] 



Solying Measurement Problems 

A-38 



; r ^n- 1 z (v,.-v 9 ,.) 



3=1 



[35] 



' -1 N ~ 1 t 



[36] 



5| - ; t - 



[37] 



I = .3435Cy 1 -y 2 )/n + -276 



[38] 



_ n!(gfp/l^(l-g-p 2 )"-yi-y2 



y^Uri-y^) 



[39] 



r(v rV v 3^ • v, * r .v 



[40] 



VjCl-y^'^+V^Vg = 0 



[41] 



v l + V 2 (1 -V 1} +V 3 =0 



[42] 



s = v 2 + v 3 



[43] 



S = ^(l-y^) 2 a~ 2 +h -1. 



[44J 



270 



Solving Measurement Problems 

A-39 



1 1 


1 


l 




v i 




0, 


1 




l 




v 2 




0 


0 


l 


l 




_ v 3 




s 



* 




[45] 



E(5|yj) ■ p(y 1 )B(v 1 ,v 2 ;v 3 r[' 



n 



l B(w+v 2> n-y^Vg) Bfoj-w+Vj+l, n-yj+w+Vg+Vg) 



[46] 




ERIC 



27i 




Solving Measurement Problems 
» A-40 



TABLE 1 

Four Possible Outcomes When an Examinee Attempts an Item 

Decision 



Latent State 


Knows 


Doesn't Know* 


Marginal 
Probabilities 


Knows 




j — 

0 




Doesn't Know 


Cq/4 + ?1 /3 + ? 2 /2 


3? Q /4 + 2^/3 + ?2 /2 






A-41 




TABLE 2 

Number of Examinees who are Correct on the 
ith Attemp£pof the item 



Attempt 



Item 


1 


2 


3 


4 


5 


'\ 


139 


14 


9 


4 


2 


2 


100 


42 


17 


6 


3 


3 


68 


34 


16 


29 


21 


4 


31 


93 


20 


15- 


9 





. - *' / Ar43 

ACKNOWLEDGEMENTS * - 

The author would like to Ahank Dr. Scdtt Fraser for generously 
' supplying the data used in this stud^, and to thank Dr.' Joan Murray 
•for helpfu^omments on an earlier draft of this paper. , 

; *The project 'presented or reported herein was -performed pursuant to 

* '* 

a grant from the National Institute of Education, Department of Health, . 
Education, and Welfare. However, the opinions expressed herein do not 

necessarily reflect-the position or policy of the National Institute of 

V ■ >N ' ' t 

Education, and^no official endorsement by the Rational Institute of 

Education should be inferred. t% . " . - 



0 



' Author's Address ^ * 
Rand\ Wilcox, 

T^si Department of Psychology 

.* ✓ 

University of Southern California 
Los Angel es, CA 90007 , ^ 



7 



4 



- < * 



- vV' • ; £75-*.* 



/ ( . * 



Solving Measurement Problems 

43 



- ' ACKNOWLEDGEMENTS 

The author would like to thank Dr. Scott Fraser for generously 
supplying the data used in this study, and to thank Dr. Joan Murray 

i 

for helpful comments on an earlier draft of this paper. 



The project presented or reported herein was performed -pursuant to a 
grant from the National Institute of , Education, Department of Health, ^ 
Education, and Welfare. .However, thejapinions expressjedJierein do not 
necessarily reflecjt^the-position or policy of the National Institute 
of Education, andno official endorsement by the National Institute 
of Education should be inferred. « 



Author's Address . - 

,Rand 'R. Wilcox , / 

Department of Psychology ; * 1 

University of Southern California * 
Los Angeles, CA 90007 * 

€5 




A' 



276 



A POLARIZATION TEST FOR MAKING INFERENCES 
ABOUT THE ENTROPY OF MULTIPLE- 
CHOICE TEST ITEMS 

Rand R. Wilcox 

/ 



/ 



DEPARTMENT OF PSYCHOLOGY 
University of Southern California 
Los Angeles,' California 90007 

and the * 

CENTE8 FOR THE STUDY OF EVALUATION 
• Graduate .School of Education 
University of •California, Los Angeles 90024 



"277 



ABSTRACT 

Under an answer-un til -correct scoring procedure, the^entropy functio 
can be used to measure the effectiveness of the distractorS of a multiple 
choice*test item^This brief note indicates how a polarization test can 
be used to determine v/hether the entropy is large or small. Included as 
a special case is an exact test or whether guessing is at random. 



v 



278 



\ 



1. INTRODUCTION 
1 

Consider a specific multiple-choice test item having k alternatives 
from which to choose, only one of which is the, correct response. Suppose 
that a randomly sampled examinee responds to the item according to an 
answer-until -correct scoring procedure- This means that the examinee 
chooses alternatives until the correct response is identified. This is 

ST 

usually accomplished by having the examinee erase a shield on an answer 
* sheet. The examinee knows immediately whether the correct response was 
chosen. If it was not, another shield is erased, and this continues 
until the correct response is identified. Wilcox (1981a) describes sev- 
eral measurement problems that this scoring procedure can solve. They 
include correcting for 'guessing without assuming guessing is at random, 
testing whether guessing is at random, measuring the effectiveness of 
distractors, and estimating the probability of correctly determining 
whether an examinee knows the correct response when a conventional scor- 
ing procedure is u$ech This last probability makes impossible to charac- 
terize n-item tests, arid a relevanl^statistical procedure has been (devel- 
oped (Wilcox, i n press). More recently the results in Wilcox (1981a) 
were extended to a strong true scoreihodel that allows guessing to vary 
over the population of examinees but which does not assume true score 
and. guessing are independent (Wilcox, 1981b), , 

Suppose an answer-Writ 11- correct scoring procedure Is used,, and let 
be the probability that a randomly selected examinee chooses correct 
Vespopse on the jth^ attempt af the item. Wilcox (1981a) makes certain 
assumptions about how examinees behave when attempting a multiple-choice 
item which 'imply that 



~4k * 



q!>q 2 >..->q k (D. 

/ 

This assumption was empirically checked v/ith 620 examinees who took three 
tests during a semestsr for a total of 117 items. At the .01 level of 

significance, it was found that all but 6 of the items satisfied this re- 
striction {Wilcox, 1981b). 

In Wilcox (1981a), it was proposed that the effectiveness of the 

distractors be measured with 

k-1 

H(q,,...,q k ) = - E p. In p. (2) 
1 K i=l 1 1 

where'p i =q 1 . +1 /(l-q 1 ). This is the entropy function which is also known as 

Shannon's measure of information or diversity. Wilcox (1981a) notes that 

if it is decided that an examinee knows the correct response if and only 

if the correct response is chosen on the first attempt of the item (i.e., 

a conventional scoring procedure is used) the distractors are the most 

effective when ^^3^**^%* Thls corl %£P onds to random guessing, and 

r 

Weitzman (1970) calls such items "ideal/ 1 The entropy function measures 
how far away an item is from being ideal. Small values of H indicate that 
guessing is not close to being random, while large values of H mean the 
item is close to being ideal. The largest possible value for H is lfl(k-l), 
and its smallest value is zero. 

For a random sample of n examinees, let be the number who choose 
the correct response on the ith try.- The maximum likelihood estimate of 
His , - 



J 



where the estimate is taken to be ln(k-l) when n=x 1 (cf. Gill &,Joanes, 

* • 

1979; Bashan'n, 1959; Hutcheson & Shenton, 1974). 

The purpose of this" note is to indicate how the polarization test 

* * < . fc 

recently proposed by Alam and Mitra (1981) might be extended to make in- 
ferences about H. Interest is focused upon testing the hypothesis 
H Q : H < h 

where h is a known constant. 'An important special case is h=ln(k-l) which«< 
corresponds to testing v/hether guessing, is at random. The appeal of the 
procedure outlined here is that the exact distribution of a statistic 
used by Alam and Mitra, which 1s described below, can be used to compare 
H to h, This is important because asymptotic approximations of the distri- 
bution of H tend to be unsatisfactory unless n is very large (Bowman »_et_al . , 
1971). Comments by AJam and Mitra (1981) indirectly confirm this. 

2. COMMENTS ON H, MAJORIZATION AND SCHUR FUNCTIONS 

When making inferences about H, the natural procedure is to use H which 
is given in equation. 3. However, the exact distribution of fi is rather com- 
plex and cumbersome to work with (Bowman,* et al . , Itt^ Instead the 

... fl^^Mt 

statistic i 

* i 

k ,' 

T(X) = Z k? . * <4) 

i=2 1 * • - 

I 

is*used. Note that if T(X) is divided by (n-x^ • we get an estimate of 

2 - *Y * 

Z p. which is known \& Siirfpson's meaSwe of diversity (Simpson, 1949). - 

1=1 1 ! 1 . _ 



281 



0 

ERLC 



5 



At first glance it might appear that equation 4 is completely unjustified 
when making inferences about H, but in terms of majorization and Schur 
functions (which are defined below) this is not the case. The goal in this 
section is to briefly outline why this is true. Additional clarification* 
of thfs point will be made in a later section. 

Consider any two vectors af (a p .... a k ) and b=(b p t> k ), and 
let a^-j >_ a j- 2 -j >....> a^-j be the components of a written in descending 

9 

order. The. vector a is said to. majorize the vector b., written, a> m b, or 



b< m a, if 



j J 

t a r • -i > Z br-T, i = l k-1 

i=l [i] - i=1 DP J A » »v» K 1 



anU 



k «* 
ill W ill 5 [i] 



_where b^.-j -is defined in the same manner as a^-j. For example, 

0) > m {H, h, 0, 0) > m ... > m '(l/k, l/k) v 

A real valued function $ is said to' be Schur convex if a> m b implies 
that $(a)>$(b). If aA implies $(a)<4>(b), the function > is Schur concave. ' 
. In statistics there, has been an increasing interest in ScHur functions, and * 
results in Alam and Mitra are formulated in terms of these concepts". For a 
recent summary of various results on Schur functions^ see Marshall and Olkin 
(1979). . '* 

To motivate the use of equation 4, first we note that given x., T is 
.a Schur convex functioo of (x ? , x n ), and H is Schur concave. This 

c n ^ 



282 



means that in the sense of majorization, both T and H can be used to measure 
the inequality of the p.'s, and indeed 'both measures are used. ' To put it 
another way, comparing H to h is comparable to comparing p to some known 
vector pQ, the comparison being made in terms of majorization. In fact, 
this is exactly what Alam and Mitra (1981) do in their paper, but they 
started with p Q rather than h. As explained in more detail below, it is 
possible to start with h, and then formulate the problem in terrts of com- 
paring p to Pq. On'ce this is done, it is poss to make use of the 
results given by Alam and Mitra, but as will become evident, certain modi- 
fications of their results will be needed. 

3. - PET E RHI ftS -A ~p ^ 1#tE N h IS GIVEN — 

Suppose h has been specified. This section outlifies how the 'problem 
of comparing H to h might be reformulated in terms of comparing p to some 
known vector P() . First note that if h=ln(k-l), which is the maximum possible 
value of H{p); comparing H to h isthe same as comparing p to 
Po^tk-l)" 1 , (k-1)" 1 ). ' * 

Next let h be any real number between 0 and ln(k-y. Since H is 
Schur concaVe, there is an integer m such that * i 

p^O/m,. . . ,l/m,0,; . . ,0)> m p 2 =( l/(m+l) ,. . '. ,l/(m+l) ,0, ... ,0) 

v 

and H(p 1 )<h<H(p 2 ) where pj has m elements equal to nf 1 and p 2 has m+1 
^'elements equal to {m+1)" 1 . "Moreover, for any c such that 0<c<m" 1 -(m-H)" 1 , 

P 3 -(m" -c, m" -c, mc, 0, 0)> m p 2> 
and P 1 > m P3- 



ERIC ■ ; , . • 2 P '■ 



In addition, as c increases, P3 decreases in the sense of majorization. 
Thus, for any h, 0<h<ln(k-l), it is possible to find a vector p Q such 
thatH(p Q )=h. ' * 

For example, suppose an item has 4 alternatives. The maximum possible 
value for the entropy of the distractors is ln(3)=1.0986. Suppose we want 
to determine whether the distractors «have at least a 80% of the maximum 
possible entropy. This corresponds to comparing H to h=.88. H(%,%,0)=.693, 
and so we. determine p Q , (%,H,0)> m p 0 > m ( 1/3, 1/3*1/3), such that H(p Q )=.8. For 
vectors of the form (2 -c, 2 -1 -c, 2c), c can be determined so that 
H(%-jC, %-c, 2c)=.8. The answer is approximately c=l/32, and so 1 
P 0 =(19)C32, 15/32, 2/32). In summary, comparing H to .88 is, in the sense 
of majorizat ion, comparable to comparing p to p 0 =(15/32, 15/32, 2/32), 

4. THE POLARIZATION TEST 




The point of the previous section is thafTthe problem of comparing H 
to h, or comparing any measure of diversity Wa known constant, can be 
reformulated in terms of comparing an unknown vector to a known vector in 

the sense of majorizajkitm. This can be done if the measure of diversity 

f > * 

is a Schur function* This section considers how p might be compared to - 

Pq once Pq is determined, / 



/ 



The Distribution of T 

The first step in devising a m&thcfd of comparing £ to 2n 1S to derive 

m 

the exact distribution of T. First, hov/ever, v/e will need the distribution 

of ■ . 
k 2 

S(X) = I XT. u 
'1=1 1 M 



where X-(xp , x^)-* 



4 



Vie note that expression (2.1) in Alam and Mitra (1981) is supposed to be 

the distribution of S(X) for k=2. HoWever, the maximum possible value of 

2 ^ 
S is n , not n, and so the inequalities^ in their expression (2.1) are 

incorrect. Another problem is that "th/ smallest possible value of S is 
2 

n /2 if n is even, an*d (n-l)(n+l)/2 if n is odd; it is not n/2*as implied 
by Alam and Mitra's equation (2.1). The sarre mistakes are made in expres- 
sion (2.2), and their expression (2.2) contains two other typographical 
errors. However, even if these corrections are made, the limits on the 

9 

summation in their expression (2.1) are incorrect. As a simple example, 



q j( 1-q which does not 



suppose n=3. Then Pr (S(X)<5)- ^jqj^l-q^ + 
agree with their results. Accordingly, the exact distribution of S(/) 
is derived here. 

First consider k=2, let c =0 if y=n/2; otherwise c =1. Let a be 
the smallest integer greater than or equal to n^^nd let b be the 
largest integer less than or equal to z-n/2, where z is the largest inte- 
ger such that z(n-z)>(n 2 -s)/2. Then , ■ 



a+b rr 
Pr(S(X)<s) = z 

y=a L 



-y 



+ 1., 



L n " y (l-q) y 



(6) 



Next consider k=3. Since the joint probability function of x« and 
x 3 given Xj is binomial with parameters q 1 /(l-^ 3 ), q 2 /(l-q 3 ) and n-Xp 



a+b 

Pr(S(X)<s|x,)= z 
y=a 



n-x 

y 



l- 



^n-x 3 -y 



+c. 



' n-xj 
n-x r y 



\n-y-x 



1 



where n-Xj replaces n in the definition of c , a, b and z. Let 
\_\($y Xj) represent the right-hand side of this last equality where 

\ ... 285 



\ 



2 

D k l (s, x-^l if s^.(n-x 1 ) . If ri-Xj is even, D k _ 1 (s, x 1 ). equals zero if 
s<(n-x 1 ) 2 /2, and if n is odd it is zero if s<{n-x JL -l)(n-x 1 +l}/2. It follows 
that 

n f >, ' • 

Pr(S(X)<s) = Z D. ,{s,x 3 ) " qlU-q/"* 1 (7) 

q K-X 1 1 

« J. *. ./ 

For k>3„the distribution of S(X) can be obtained recursively in the same 
manner. 

Having established the distribution of S(X), it is now possible to 
test the hypothesis H>h which, via majorization, is comparable to testing 

Let B| C (s;q 1 ,...,q k ,n)=Pr(S(X)<s). In terms of the x^'s, the decision 
rule is to reject Hq if . • 

k 2 
z xf>t. 
i=2 1 

where t is to be determined. From equation 7 



Pr 



k 2 ^ 
I x. <t j x\ 
i=2 1 1 



= B k-i p r *••• p k-r n ~ x P 



where, as before, P^q^/U-qj). 

Since unde^the null hypothesis P=Pq, equation 8 can be evaluated, 
and so for any observed x^, the probability of a type I error can be 
determined for any t. 

An Illustration 

As a simple illustration; suppose k=3 and it is decided to test 
t HqI H > *5. 



286 



Frbm section 3, p Q is approximately (.8, .2) which, in the sense of 
majorization, is the same as (.2, .8). Thus, in terms of the polarization 
test, / ^ 
H Q : p < m (.8, .2). 

Suppose n=100 examinees are randomly sampled and that Xj'75, x 2 =21 

3 o 

and x 3 =4 are observed. Then I x.7=457. Setting p=(.8, .2), equation 8 
yields 'the value of Pr(T(X)<425 |X]=75) which in turn gives the value of 
Pr(T(X)>425|x 1 =75). Using the tables in Pearson (1968), the latter value 
was found to be .4206, and so the observed^' s are reasonably consistent 
with the null hypothesis. If instead x£=24, and x 3 »l, Pr(T(X)>577jx 1 =75)=.023, 
and so the results would be significant at the .05 level. 

An optimal property of the test . A desirable property of any hypo- 
thesis testing procedure is that as the unknown parameters move away from ■ 
the null hypothesis, the power of the test increases. Here this means 
that -if p' and p" are any two vectors such that p'> m p"'> m p Q) we want the . 
power of the test p< m p Q to be larger at p=p' than it is at p=p'-*. That 
this property holds follows immediately from a theorem in Marshall and 01 kin 
(1979, p. 391). Thus, we have an additional justifiction for using the 
polarization test as it is outlined above. 

•> - 

' SUMMARY , 

In.sunvnary, the paper describes how hypotheses about the effectiveness 
of the distractors of multiple choice test items might be tested. Included 
ajs a special case is an exact test for random guessing that can be used in " 

V 

• ' , 287 i ' " - 



- :U 



11. 



corCjunction with an answer-until -correct scoring procedure* This fs in 
contrast to the asymptotic test for random guastffng (which do&s>not us-e * ' 
an*%nswer-until-co£rett scoring rule) that v/as proposed by WeUzman (1970). 

Another point i^that it is not being recommended th^an item be 
modified if H Q is rejected, Wilcox (1981a) describes how tfie accuracy t% 
a test item can be estimated. If tfife accuracy is high, tbjgre may be little^, 
reason for trying to improve the distractors by-'ensurin 
The reason is that any improvements in the distractors might ^ield a 
negligible increase^in item accuracy. "However, if it^ accuracy is moderate 
.or small, and if H Q is rejected,rconTi^ration nnghrifbe given *xo improving 
the distractors. 




\ 



) 



• . . ' 12 

- '* . 

' REFERENCES 

^Alam, K. , &Mitra, A. Polarization test , for the multinomial distribution. . 

Journal of the American Statistical Association , 1981, 76, 107-109. 
Basharin, 6. On a statistical estimate for the entropy of a sequence of 
independent random variables., Theory, of Probability and its Applfcations , 
1959, 4. 333-336. 

Bowman, K. , Hutcheson, K. , Odum, E., & Shenton, L. Comments on the cfistri- 
butioirof indices of diversity. In G. Patil, E. Pielou, and W. Waters , 
(Eds.)- International Symposium on Statistical Ecology , Vol. 3, 
University Park: Pennsylvania State Press, 1979/ ^ 

Gill, C, & Joanes, D. Bayesian estimation of Shannon's Hndex of diveV 

, sity; Biometrika , 1979, 66, 81-85. . 

Hutcheson,. K. , & Shenton, L. Some moments of an estimate of Shannon *s 
measure of information. Communications in Statistics , 1974, 3, 89-94. 

Marshall, A., & 01 kin, I. Inequalities: Theory of majorization and its 1 
applications . New York: Academic Press, 1979. 

i 

Simpson ,*E. Measurement of diversity. Nature , 1949, 163 , 688. 
Weitzman, R. A. Ideal multiple-choice items. Journal of the American 

Statistical Association , 1970, 65, 71-89. 
Wilcox^, R. R. Solving measurement problems with an answer-until -correct 

scoring procedure. Applied Psychological Measurement , 1981, 5, 

in press (a). ' ' 
.Wilcox, R. R. Using results on k out of n system reliability to study and, 
• characterize, tests. Educational and Psychological Measurement , in press. 
Wilcox, R. R. Some empirical and theoretical results* on' an answer-until- 

correct scoring procedure. British Journal" of Mathematical and Statistical 

Psychology , .1981', submitted for publication (b). ' 

' ' • 289' 



' . . • 12 

' REFERENCES 

^Alam, K. , & Mitra, A. Polarization test ,f or' the multinomial distribution. . 
Journal of the American Statistical Association , 1981, 76, 107-109. 

Basharin, G. On a statistical estimate for the entropy of a sequence of 
independent random variables. Theory, of Probability and its Applications . 
1959, 4. 333-336. . I 

Bowman, K. , Hutc'heson, K. , Odum, E., & Shenton, L. Comments on the distri- 
bution of indices of diversity. In G. Patil, E. Pielou, and W. Waters / 
(Eds.). International Symposium on Statistical Ecolcfgy . Vol. 3, 
University Park: Pennsylvania State Press, 1979/ ^ 

Gill, C, & Ooanes, D. Bayesian estimation of Shannon 's lindex of dive'r- 

, sity: Biometrika , 1979, 66, 81-85. . 

Hutcheson,- K. , & Shenton, L. Some moments of an estimate of Shannon *s 
measure of information. Communications in Statistics , 1974, 3, 89-94. 

Marshall, A., & 01 kin, I. Inequalities: Theory of majorization and its ■ 
applications . New York: Academic Press, 1979. 

i 

Simpson,* E. Measurement of diversity. Nature, 1949, 163, 688. 

Weitzman, R. A. Ideal multiple-choice items. Journal of the American 

* • >, 

Statistical Association , 1970, 65, 71-89. 

Wilcox^, R. R. Solving measurement^ problems with an answer- until -correct 
scoring procedure. Applied Psychological Measurement , 1981, 5, 
in press (a). * ' 

.Wilcox, R. R. Using results on k out of n system reliability to study and. 

• characterize, tests. Educational and Psychological Measurement , in press. 

Wilcox, R. R. Some empirical and theoretical results' on an answer-until- 

# 

correct scoring procedure. British Journaf of Mathematical and Statistical 
Psychology , ,1981' submitted for publication (b). ' 

' ' • 28JT 



